To navigate the concepts in this course effectively, we need a common language for the mathematical objects and operations involved. This section establishes the notation used consistently across chapters. While we adhere to standard conventions in machine learning and deep learning literature where possible, clarity and consistency within this course are the primary goals. Familiarizing yourself with this notation will help in understanding the equations describing model architectures, training algorithms, and evaluation metrics.
General Mathematical Conventions
Scalars: Represented by lowercase italic letters (e.g., a,x,λ,η). These typically denote single numerical values, such as learning rates, regularization parameters, or individual elements of vectors/matrices.
Vectors: Represented by lowercase bold letters (e.g., x,y,w,b). By default, vectors are assumed to be column vectors. We denote the dimensionality as x∈Rd, indicating a vector with d real-valued elements. The i-th element of x is xi​.
Matrices: Represented by uppercase bold letters (e.g., X,Y,W). We denote the dimensionality as W∈Rm×n, indicating a matrix with m rows and n columns. The element in the i-th row and j-th column is Wij​ or wij​. An identity matrix is denoted by I.
Tensors: Higher-order arrays (rank > 2) are sometimes represented by uppercase calligraphic letters (e.g., T) or uppercase bold letters if the context makes the dimensionality clear (e.g., a batch of matrices). Dimensions will be specified, for instance, T∈Rd1​×d2​×⋯×dk​.
Indices and Summations: Typically use i,j,k for indexing elements or dimensions. t is frequently used to denote a specific position or time step in a sequence. Summation is denoted by ∑.
Sets: Represented by uppercase calligraphic letters (e.g., D for a dataset, V for a vocabulary). The size or cardinality of a set S is denoted by ∣S∣.
Functions: Standard mathematical functions use italic lowercase (e.g., f(⋅),g(⋅)). Activation functions are often denoted by Greek letters (e.g., σ(⋅) for sigmoid, ϕ(⋅) for ReLU variants like GeLU). L(⋅) or J(⋅) typically represent loss or objective functions.
Derivatives and Gradients: The gradient of a scalar function J with respect to a vector w is denoted ∇w​J(w) or simply ∇J if the variable is clear from context. Partial derivatives are written as ∂x∂f​.
Specific Notations for Language Models
Sequences: An input sequence of length T is often represented as a list or tuple of tokens (x1​,x2​,…,xT​) or their corresponding embedding vectors (x1​,x2​,…,xT​). T denotes the sequence length.
Vocabulary and Tokenization: The set of unique tokens (words, subwords) is the vocabulary V. Its size is ∣V∣. xt​ often represents the integer index of the token at position t.
Embeddings:
Token embedding matrix: E∈R∣V∣×dmodel​, where dmodel​ is the model's hidden dimension.
Embedding vector for token index i: ei​, which is the i-th row of E.
Positional encoding vector for position t: pt​∈Rdmodel​.
Input representation at position t: zt​=et​+pt​ (or variations depending on the model).
Transformer Components:
Query, Key, Value matrices for a sequence: Q,K,V∈RT×dk​ (within a single attention head, or RT×dmodel​ before projection).
Associated weight matrices: WQ,WK,WV∈Rdmodel​×dk​ (per head) or Rdmodel​×dmodel​ (overall projection).
Hidden states at layer l: H(l)∈RT×dmodel​. The input embeddings are often H(0).
Feed-Forward Network: FFN(â‹…).
Model Parameters: The set of all trainable parameters (weights and biases) is denoted by θ.
Data and Training:
Dataset: D.
Training example: (x,y) where x is input and y is target.
Batch of data: B. Batch size B=∣B∣.
Loss function: L(θ) or J(θ). Per-example loss L(y^​,y).
Probability:
Probability distribution: P(â‹…).
Conditional probability: P(Y∣X).
Model's predicted probability for token y given context c: Pθ​(y∣c).
Hyperparameters:
Learning rate: η.
Number of layers: L.
Model hidden dimension: dmodel​.
FFN intermediate dimension: dff​.
Number of attention heads: h.
Dimension per head: dk​=dmodel​/h.
Dropout probability: pdrop​.
Weight decay coefficient: λ.
Mapping to Code (PyTorch Example)
Mathematical notation maps directly to tensor operations in frameworks like PyTorch. Understanding this mapping is helpful for implementation.
A vector x∈Rd:
import torch
d = 128
x = torch.randn(d) # Typically a 1D tensor
# Or explicitly a column vector (2D tensor)
x_col = torch.randn(d, 1)
print(f"Vector shape: {x.shape}, Column vector shape: {x_col.shape}")
A matrix W∈Rm×n:
m, n = 64, 128
W = torch.randn(m, n) # A 2D tensor
print(f"Matrix shape: {W.shape}")
Matrix-vector multiplication y=Wx+b, where W∈Rm×n, x∈Rn, b∈Rm, y∈Rm:
# Assuming W from above, and matching x, b
n = 128
m = 64
x = torch.randn(n)
b = torch.randn(m)
W = torch.randn(m, n)
# Using torch.matmul or @ operator
y = W @ x + b
# Alternative: torch.nn.functional.linear
# import torch.nn.functional as F
# y_f = F.linear(x, W, b) # Note: F.linear expects W^T implicitly sometimes, check docs
print(f"Input x shape: {x.shape}")
print(f"Weight W shape: {W.shape}")
print(f"Bias b shape: {b.shape}")
print(f"Output y shape: {y.shape}")
Batch processing: Often, the first dimension represents the batch size B. For example, a batch of sequences might have the shape (B,T,dmodel​).
This notation forms the basis for our discussions. Any deviations or context-specific symbols introduced in later chapters will be defined locally. Keep this reference handy as you progress through the material.