Linear algebra provides the mathematical language for manipulating the high-dimensional data prevalent in large language models. Vectors and matrices are the primary objects we use to represent input data, model parameters, and intermediate activations within neural networks. Understanding their properties and operations is fundamental for grasping how information flows and transforms within these models.

Vectors: Representing Data Points

At its core, a vector is an ordered list of numbers, often representing a point or direction in a multi-dimensional space. In the context of LLMs, vectors typically represent:

Word Embeddings: Dense representations capturing semantic meaning, where each dimension corresponds to a latent feature. For example, a vocabulary word might be mapped to a 300-dimensional vector.
Hidden States: Intermediate representations within the network layers (like RNN hidden states or Transformer layer outputs), capturing contextual information about the input sequence up to a certain point.
Gradients: Vectors indicating the direction and magnitude of the steepest ascent of the loss function with respect to model parameters or activations.

A vector $v$ in $\mathbb{R}^n$ is denoted as:

v = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}

Basic operations like vector addition and scalar multiplication allow us to combine or scale these representations. For instance, adding an input embedding to a positional encoding vector combines semantic and positional information.

In PyTorch, vectors are represented as 1-dimensional tensors.

import torch

# Example: A 5-dimensional vector
vector_a = torch.tensor([1.0, 2.5, -0.8, 4.0, 0.0])

# Example: Scalar multiplication
scaled_vector = 2.0 * vector_a

# Example: Vector addition
vector_b = torch.tensor([-0.5, 1.0, 1.2, -2.0, 1.5])
summed_vector = vector_a + vector_b

print(f"Original Vector A: {vector_a}")
print(f"Scaled Vector: {scaled_vector}")
print(f"Summed Vector: {summed_vector}")
print(f"Vector Dimension (rank): {vector_a.ndim}")
print(f"Vector Shape: {vector_a.shape}")

Matrices: Representing Transformations and Parameters

Matrices are rectangular arrays of numbers, extending vectors to two dimensions. Their most significant role in deep learning is representing linear transformations between vector spaces.

Weight Matrices: The parameters of linear layers (or fully connected layers) are stored in matrices. Applying a layer involves multiplying the input vector (or matrix of input vectors) by the layer's weight matrix.
Attention Scores: Matrices can represent the calculated attention weights between different positions in a sequence.
Batches of Data: A collection of input vectors (e.g., embeddings for multiple words in a sequence or multiple sequences in a batch) is often arranged as a matrix.

A matrix $A$ with $m$ rows and $n$ columns ( $A \in \mathbb{R}^{m \times n}$ ) is:

A = \begin{bmatrix} A_{1,1} & A_{1,2} & \cdots & A_{1,n} \\ A_{2,1} & A_{2,2} & \cdots & A_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m,1} & A_{m,2} & \cdots & A_{m,n} \end{bmatrix}

Core Operations for Neural Networks

Certain linear algebra operations are ubiquitous in neural network computations:

Matrix-Vector Multiplication

This operation applies the linear transformation defined by a matrix $W$ to a vector $x$ . If $W \in \mathbb{R}^{m \times n}$ and $x \in \mathbb{R}^n$ , the result $y = Wx$ is a vector in $\mathbb{R}^m$ . This is the fundamental calculation within a dense layer (without bias) in a neural network, transforming an $n$ -dimensional input representation into an $m$ -dimensional output representation.

y_i = \sum_{j=1}^{n} W_{i,j} x_j

import torch

# Define a weight matrix (e.g., for a layer mapping 4 features to 3)
W = torch.randn(3, 4) # Shape: (output_dim, input_dim)

# Define an input vector (4 features)
x = torch.tensor([1.0, 0.5, -1.0, 2.0]) # Shape: (input_dim,)

# Perform matrix-vector multiplication
# Note: torch.matmul handles dimensions appropriately
y = torch.matmul(W, x) # or W @ x

print(f"Weight Matrix W (shape {W.shape}):\n{W}")
print(f"\nInput Vector x (shape {x.shape}): {x}")
print(f"\nOutput Vector y (shape {y.shape}): {y}")

A matrix W transforms vector x from a 4-dimensional space to a 3-dimensional space.

Matrix-Matrix Multiplication

Multiplying two matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ results in a matrix $C = AB \in \mathbb{R}^{m \times p}$ . This is used extensively when processing data in batches, where the input $X$ might be a matrix where each row is an input vector, or when composing multiple linear transformations. For instance, the computation within a Transformer's feed-forward network often involves multiple matrix multiplications.

C_{i,k} = \sum_{j=1}^{n} A_{i,j} B_{j,k}

import torch

# Input batch (e.g., 2 sequences/samples, 4 features each)
X = torch.randn(2, 4) # Shape: (batch_size, input_dim)

# Weight matrix from previous example
W = torch.randn(3, 4) # Shape: (output_dim, input_dim)

# Apply transformation to the batch
# We need W transposed to match dimensions for standard matmul convention
# Y = X @ W.T results in (2, 4) @ (4, 3) -> (2, 3)
Y = torch.matmul(X, W.T) # Shape: (batch_size, output_dim)

print(f"Input Batch X (shape {X.shape}):\n{X}")
print(f"\nWeight Matrix W Transposed (shape {W.T.shape}):\n{W.T}")
print(f"\nOutput Batch Y (shape {Y.shape}):\n{Y}")

Element-wise Operations (Hadamard Product)

This involves multiplying corresponding elements of two matrices (or vectors) of the same shape. Denoted $A \odot B$ , the result $C$ has $C_{i,j} = A_{i,j} \times B_{i,j}$ . This is distinct from matrix multiplication and appears in various neural network components, such as applying activation functions element-wise or implementing gating mechanisms in LSTMs or GRUs.

import torch

A = torch.tensor([[1., 2.], [3., 4.]])
B = torch.tensor([[0.5, 1.], [-1., 2.]])

# Element-wise multiplication
C = A * B # or torch.multiply(A, B)

print(f"Matrix A:\n{A}")
print(f"Matrix B:\n{B}")
print(f"Element-wise Product C:\n{C}")

The Dot Product

A fundamental operation is the dot product (or inner product) between two vectors $v, w \in \mathbb{R}^n$ . It's calculated as $v \cdot w = \sum_{i=1}^n v_i w_i$ . Geometrically, it relates to the projection of one vector onto another ( $v \cdot w = \|v\| \|w\| \cos \theta$ , where $\theta$ is the angle between them).

The dot product is computationally equivalent to matrix multiplication if the first vector is treated as a row vector and the second as a column vector: $v^T w$ .

v^T w = \begin{bmatrix} v_1 & v_2 & \cdots & v_n \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} = \sum_{i=1}^n v_i w_i

In LLMs, the dot product is central to the attention mechanism. Scaled dot-product attention computes the relevance between query ( $Q$ ) and key ( $K$ ) vectors using dot products to determine how much focus to place on different parts of the input sequence.

import torch

v = torch.tensor([1.0, 2.0, -1.0])
w = torch.tensor([3.0, -1.0, 0.5])

# Calculate dot product
dot_product_val = torch.dot(v, w)

print(f"Vector v: {v}")
print(f"Vector w: {w}")
print(
    f"Dot Product: {dot_product_val}"
) # Expected: (1*3) + (2*-1) + (-1*0.5) = 3 - 2 - 0.5 = 0.5

Norms: Measuring Vector Magnitude

A norm is a function that assigns a strictly positive length or size to each vector in a vector space (except for the zero vector, which has zero length). The most common norms in machine learning are:

L2 Norm (Euclidean Norm): $\|v\|_2 = \sqrt{\sum_{i=1}^n v_i^2}$ . This corresponds to the standard Euclidean distance from the origin. It's frequently used in regularization (L2 regularization or weight decay) to penalize large parameter values and prevent overfitting.
L1 Norm (Manhattan Norm): $\|v\|_1 = \sum_{i=1}^n |v_i|$ . It measures the sum of the absolute values of the components. L1 regularization encourages sparsity (driving some parameters to exactly zero).

Norms are also used in normalization techniques like Layer Normalization, which often involve scaling activations based on their L2 norm.

import torch

v = torch.tensor([3.0, -4.0, 0.0])

l2_norm = torch.linalg.norm(v, ord=2) # or simply torch.linalg.norm(v)
l1_norm = torch.linalg.norm(v, ord=1)

print(f"Vector v: {v}")
# Expected: sqrt(3^2 + (-4)^2 + 0^2) = sqrt(9 + 16) = sqrt(25) = 5.0
print(f"L2 Norm: {l2_norm}")
print(f"L1 Norm: {l1_norm}") # Expected: |3| + |-4| + |0| = 3 + 4 + 0 = 7.0

Dimensionality and Tensors

While we've focused on vectors (1D) and matrices (2D), deep learning heavily relies on tensors, which are generalizations to higher dimensions. For example:

Input to an LLM might be a 3D tensor: (batch_size, sequence_length, embedding_dim).
Weights of a convolutional layer (used in some architectures) might be a 4D tensor.

Keeping track of tensor shapes is essential for ensuring operations are compatible. Mismatched dimensions are a common source of errors in deep learning code. PyTorch and other frameworks provide tools to inspect and manipulate tensor shapes (.shape, .reshape(), .permute(), etc.).

This review covers the most immediately relevant linear algebra concepts. As we proceed, particularly when discussing the Transformer architecture and attention mechanisms, the roles of matrix multiplication, dot products, and managing tensor dimensions will become increasingly apparent. Having a firm grasp of these operations is indispensable for understanding and implementing large language models effectively.

Was this section helpful?