Masterclass
Linear algebra provides the mathematical language for manipulating the high-dimensional data prevalent in large language models. Vectors and matrices are the primary objects we use to represent input data, model parameters, and intermediate activations within neural networks. Understanding their properties and operations is fundamental for grasping how information flows and transforms within these models.
At its core, a vector is an ordered list of numbers, often representing a point or direction in a multi-dimensional space. In the context of LLMs, vectors typically represent:
A vector v in Rn is denoted as:
v=v1v2⋮vnBasic operations like vector addition and scalar multiplication allow us to combine or scale these representations. For instance, adding an input embedding to a positional encoding vector combines semantic and positional information.
In PyTorch, vectors are represented as 1-dimensional tensors.
import torch
# Example: A 5-dimensional vector
vector_a = torch.tensor([1.0, 2.5, -0.8, 4.0, 0.0])
# Example: Scalar multiplication
scaled_vector = 2.0 * vector_a
# Example: Vector addition
vector_b = torch.tensor([-0.5, 1.0, 1.2, -2.0, 1.5])
summed_vector = vector_a + vector_b
print(f"Original Vector A: {vector_a}")
print(f"Scaled Vector: {scaled_vector}")
print(f"Summed Vector: {summed_vector}")
print(f"Vector Dimension (rank): {vector_a.ndim}")
print(f"Vector Shape: {vector_a.shape}")
Matrices are rectangular arrays of numbers, extending vectors to two dimensions. Their most significant role in deep learning is representing linear transformations between vector spaces.
A matrix A with m rows and n columns (A∈Rm×n) is:
A=A1,1A2,1⋮Am,1A1,2A2,2⋮Am,2⋯⋯⋱⋯A1,nA2,n⋮Am,nCertain linear algebra operations are ubiquitous in neural network computations:
This operation applies the linear transformation defined by a matrix W to a vector x. If W∈Rm×n and x∈Rn, the result y=Wx is a vector in Rm. This is the fundamental calculation within a dense layer (without bias) in a neural network, transforming an n-dimensional input representation into an m-dimensional output representation.
yi=j=1∑nWi,jxjimport torch
# Define a weight matrix (e.g., for a layer mapping 4 features to 3)
W = torch.randn(3, 4) # Shape: (output_dim, input_dim)
# Define an input vector (4 features)
x = torch.tensor([1.0, 0.5, -1.0, 2.0]) # Shape: (input_dim,)
# Perform matrix-vector multiplication
# Note: torch.matmul handles dimensions appropriately
y = torch.matmul(W, x) # or W @ x
print(f"Weight Matrix W (shape {W.shape}):\n{W}")
print(f"\nInput Vector x (shape {x.shape}): {x}")
print(f"\nOutput Vector y (shape {y.shape}): {y}")
A matrix W transforms vector x from a 4-dimensional space to a 3-dimensional space.
Multiplying two matrices A∈Rm×n and B∈Rn×p results in a matrix C=AB∈Rm×p. This is used extensively when processing data in batches, where the input X might be a matrix where each row is an input vector, or when composing multiple linear transformations. For instance, the computation within a Transformer's feed-forward network often involves multiple matrix multiplications.
Ci,k=j=1∑nAi,jBj,kimport torch
# Input batch (e.g., 2 sequences/samples, 4 features each)
X = torch.randn(2, 4) # Shape: (batch_size, input_dim)
# Weight matrix from previous example
W = torch.randn(3, 4) # Shape: (output_dim, input_dim)
# Apply transformation to the batch
# We need W transposed to match dimensions for standard matmul convention
# Y = X @ W.T results in (2, 4) @ (4, 3) -> (2, 3)
Y = torch.matmul(X, W.T) # Shape: (batch_size, output_dim)
print(f"Input Batch X (shape {X.shape}):\n{X}")
print(f"\nWeight Matrix W Transposed (shape {W.T.shape}):\n{W.T}")
print(f"\nOutput Batch Y (shape {Y.shape}):\n{Y}")
This involves multiplying corresponding elements of two matrices (or vectors) of the same shape. Denoted A⊙B, the result C has Ci,j=Ai,j×Bi,j. This is distinct from matrix multiplication and appears in various neural network components, such as applying activation functions element-wise or implementing gating mechanisms in LSTMs or GRUs.
import torch
A = torch.tensor([[1., 2.], [3., 4.]])
B = torch.tensor([[0.5, 1.], [-1., 2.]])
# Element-wise multiplication
C = A * B # or torch.multiply(A, B)
print(f"Matrix A:\n{A}")
print(f"Matrix B:\n{B}")
print(f"Element-wise Product C:\n{C}")
A fundamental operation is the dot product (or inner product) between two vectors v,w∈Rn. It's calculated as v⋅w=∑i=1nviwi. Geometrically, it relates to the projection of one vector onto another (v⋅w=∥v∥∥w∥cosθ, where θ is the angle between them).
The dot product is computationally equivalent to matrix multiplication if the first vector is treated as a row vector and the second as a column vector: vTw.
vTw=[v1v2⋯vn]w1w2⋮wn=i=1∑nviwiIn LLMs, the dot product is central to the attention mechanism. Scaled dot-product attention computes the relevance between query (Q) and key (K) vectors using dot products to determine how much focus to place on different parts of the input sequence.
import torch
v = torch.tensor([1.0, 2.0, -1.0])
w = torch.tensor([3.0, -1.0, 0.5])
# Calculate dot product
dot_product_val = torch.dot(v, w)
print(f"Vector v: {v}")
print(f"Vector w: {w}")
print(
f"Dot Product: {dot_product_val}"
) # Expected: (1*3) + (2*-1) + (-1*0.5) = 3 - 2 - 0.5 = 0.5
A norm is a function that assigns a strictly positive length or size to each vector in a vector space (except for the zero vector, which has zero length). The most common norms in machine learning are:
Norms are also used in normalization techniques like Layer Normalization, which often involve scaling activations based on their L2 norm.
import torch
v = torch.tensor([3.0, -4.0, 0.0])
l2_norm = torch.linalg.norm(v, ord=2) # or simply torch.linalg.norm(v)
l1_norm = torch.linalg.norm(v, ord=1)
print(f"Vector v: {v}")
# Expected: sqrt(3^2 + (-4)^2 + 0^2) = sqrt(9 + 16) = sqrt(25) = 5.0
print(f"L2 Norm: {l2_norm}")
print(f"L1 Norm: {l1_norm}") # Expected: |3| + |-4| + |0| = 3 + 4 + 0 = 7.0
While we've focused on vectors (1D) and matrices (2D), deep learning heavily relies on tensors, which are generalizations to higher dimensions. For example:
(batch_size, sequence_length, embedding_dim)
.Keeping track of tensor shapes is essential for ensuring operations are compatible. Mismatched dimensions are a common source of errors in deep learning code. PyTorch and other frameworks provide tools to inspect and manipulate tensor shapes (.shape
, .reshape()
, .permute()
, etc.).
This review covers the most immediately relevant linear algebra concepts. As we proceed, particularly when discussing the Transformer architecture and attention mechanisms, the roles of matrix multiplication, dot products, and managing tensor dimensions will become increasingly apparent. Having a firm grasp of these operations is indispensable for understanding and implementing large language models effectively.
© 2025 ApX Machine Learning