A common language for mathematical objects and operations is essential for clear communication in machine learning. Notation for consistent use is established. While standard conventions in machine learning and deep learning literature are followed where possible, clarity and consistency are primary goals. Familiarity with this notation aids in understanding the equations describing model architectures, training algorithms, and evaluation metrics.
General Mathematical Conventions
Scalars: Represented by lowercase italic letters (e.g., a,x,λ,η). These typically denote single numerical values, such as learning rates, regularization parameters, or individual elements of vectors/matrices.
Vectors: Represented by lowercase bold letters (e.g., x,y,w,b). By default, vectors are assumed to be column vectors. We denote the dimensionality as x∈Rd, indicating a vector with d real-valued elements. The i-th element of x is xi.
Matrices: Represented by uppercase bold letters (e.g., X,Y,W). We denote the dimensionality as W∈Rm×n, indicating a matrix with m rows and n columns. The element in the i-th row and j-th column is Wij or wij. An identity matrix is denoted by I.
Tensors: Higher-order arrays (rank > 2) are sometimes represented by uppercase calligraphic letters (e.g., T) or uppercase bold letters if the context makes the dimensionality clear (e.g., a batch of matrices). Dimensions will be specified, for instance, T∈Rd1×d2×⋯×dk.
Indices and Summations: Typically use i,j,k for indexing elements or dimensions. t is frequently used to denote a specific position or time step in a sequence. Summation is denoted by ∑.
Sets: Represented by uppercase calligraphic letters (e.g., D for a dataset, V for a vocabulary). The size or cardinality of a set S is denoted by ∣S∣.
Functions: Standard mathematical functions use italic lowercase (e.g., f(⋅),g(⋅)). Activation functions are often denoted by Greek letters (e.g., σ(⋅) for sigmoid, ϕ(⋅) for ReLU variants like GeLU). L(⋅) or J(⋅) typically represent loss or objective functions.
Derivatives and Gradients: The gradient of a scalar function J with respect to a vector w is denoted ∇wJ(w) or simply ∇J if the variable is clear from context. Partial derivatives are written as ∂x∂f.
Specific Notations for Language Models
Sequences: An input sequence of length T is often represented as a list or tuple of tokens (x1,x2,…,xT) or their corresponding embedding vectors (x1,x2,…,xT). T denotes the sequence length.
Vocabulary and Tokenization: The set of unique tokens (words, subwords) is the vocabulary V. Its size is ∣V∣. xt often represents the integer index of the token at position t.
Embeddings:
Token embedding matrix: E∈R∣V∣×dmodel, where dmodel is the model's hidden dimension.
Embedding vector for token index i: ei, which is the i-th row of E.
Positional encoding vector for position t: pt∈Rdmodel.
Input representation at position t: zt=et+pt (or variations depending on the model).
Transformer Components:
Query, Key, Value matrices for a sequence: Q,K,V∈RT×dk (within a single attention head, or RT×dmodel before projection).
Associated weight matrices: WQ,WK,WV∈Rdmodel×dk (per head) or Rdmodel×dmodel (overall projection).
Hidden states at layer l: H(l)∈RT×dmodel. The input embeddings are often H(0).
Feed-Forward Network: FFN(⋅).
Model Parameters: The set of all trainable parameters (weights and biases) is denoted by θ.
Data and Training:
Dataset: D.
Training example: (x,y) where x is input and y is target.
Batch of data: B. Batch size B=∣B∣.
Loss function: L(θ) or J(θ). Per-example loss L(y^,y).
Probability:
Probability distribution: P(⋅).
Conditional probability: P(Y∣X).
Model's predicted probability for token y given context c: Pθ(y∣c).
Hyperparameters:
Learning rate: η.
Number of layers: L.
Model hidden dimension: dmodel.
FFN intermediate dimension: dff.
Number of attention heads: h.
Dimension per head: dk=dmodel/h.
Dropout probability: pdrop.
Weight decay coefficient: λ.
Mapping to Code (PyTorch Example)
Mathematical notation maps directly to tensor operations in frameworks like PyTorch. Understanding this mapping is helpful for implementation.
A vector x∈Rd:
import torch
d = 128
x = torch.randn(d) # Typically a 1D tensor
# Or explicitly a column vector (2D tensor)
x_col = torch.randn(d, 1)
print(f"Vector shape: {x.shape}, Column vector shape: {x_col.shape}")
A matrix W∈Rm×n:
m, n = 64, 128
W = torch.randn(m, n) # A 2D tensor
print(f"Matrix shape: {W.shape}")
Matrix-vector multiplication y=Wx+b, where W∈Rm×n, x∈Rn, b∈Rm, y∈Rm:
# Assuming W from above, and matching x, b
n = 128
m = 64
x = torch.randn(n)
b = torch.randn(m)
W = torch.randn(m, n)
# Using torch.matmul or @ operator
y = W @ x + b
# Alternative: torch.nn.functional.linear
# import torch.nn.functional as F
# y_f = F.linear(x, W, b) # Note: F.linear expects W^T implicitly sometimes, check docs
print(f"Input x shape: {x.shape}")
print(f"Weight W shape: {W.shape}")
print(f"Bias b shape: {b.shape}")
print(f"Output y shape: {y.shape}")
Batch processing: Often, the first dimension represents the batch size B. For example, a batch of sequences might have the shape (B,T,dmodel).
This notation forms the basis for our discussions. Any deviations or context-specific symbols introduced in later chapters will be defined locally. Keep this reference handy as you progress through the material.
Was this section helpful?
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A standard textbook that establishes mathematical notation for linear algebra, calculus, and probability within the context of machine learning and deep learning.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and defines much of the specific notation used for its components, such as Q, K, V matrices and the self-attention mechanism.
PyTorch Tensors, PyTorch Authors, 2024 - Official documentation explaining PyTorch tensors, their creation, shapes, and basic operations, directly relevant to understanding the code mapping of mathematical notation.
CS224n: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2025 (Stanford University) - A leading university course that consistently applies standard notation for deep learning, especially within the context of natural language processing and transformer models.