Model-Agnostic Meta-Learning (MAML) stands as a prominent gradient-based meta-learning algorithm. Its central premise, as introduced in the chapter overview, is to find a set of initial model parameters $\theta$ that are highly amenable to rapid adaptation. Instead of learning parameters that perform well on average across tasks, MAML learns parameters that require only a few gradient updates on a small amount of data from a new task to achieve good performance on that specific task.

Mathematical Formulation

Let's formalize this. Consider a distribution of tasks $p(\mathcal{T})$ . During meta-training, we sample a batch of tasks $\{\mathcal{T}_i\}_{i=1}^B$ . Each task $\mathcal{T}_i$ is associated with a loss function $\mathcal{L}_{\mathcal{T}_i}$ and typically has a support set $D_{\mathcal{T}_i}^{supp}$ for adaptation and a query set $D_{\mathcal{T}_i}^{query}$ for evaluating the adapted parameters.

The core idea involves a two-step optimization process:

Inner Loop (Task-Specific Adaptation): For each task $\mathcal{T}_i$ , starting from the shared initial parameters $\theta$ , we perform one or more gradient descent steps using the task's support set $D_{\mathcal{T}_i}^{supp}$ . For a single gradient step with learning rate $\alpha$ :
$\theta'_i = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(\theta, D_{\mathcal{T}_i}^{supp})$
These adapted parameters $\theta'_i$ are specific to task $\mathcal{T}_i$ .
Outer Loop (Meta-Optimization): The goal is to update the initial parameters $\theta$ to minimize the expected loss across tasks after adaptation. The meta-objective function is the sum (or average) of the losses computed using the adapted parameters $\theta'_i$ on their respective query sets $D_{\mathcal{T}_i}^{query}$ :
$\min_\theta \sum_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(\theta'_i, D_{\mathcal{T}_i}^{query})$
Substituting the expression for $\theta'_i$ (from the inner loop), the objective becomes:
$\min_\theta \sum_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(\theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(\theta, D_{\mathcal{T}_i}^{supp}), D_{\mathcal{T}_i}^{query})$

The meta-parameters $\theta$ are updated using gradient descent based on this meta-objective, typically using a meta-learning rate $\beta$ :

\theta \leftarrow \theta - \beta \nabla_\theta \sum_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(\theta'_i, D_{\mathcal{T}_i}^{query})

The Meta-Gradient Computation

Calculating the meta-gradient $\nabla_\theta \sum_{\mathcal{T}_i} \mathcal{L}_{\mathcal{T}_i}(\theta'_i, D_{\mathcal{T}_i}^{query})$ is the most intricate part of MAML. Since $\theta'_i$ depends on $\theta$ through a gradient update step, applying the chain rule involves differentiating the inner loop's gradient.

Let's consider a single task $\mathcal{T}$ and simplify notation: $L_{supp}(\theta) = \mathcal{L}_{\mathcal{T}}(\theta, D_{\mathcal{T}}^{supp})$ and $L_{query}(\theta') = \mathcal{L}_{\mathcal{T}}(\theta', D_{\mathcal{T}}^{query})$ . The adapted parameters are $\theta' = \theta - \alpha \nabla_\theta L_{supp}(\theta)$ . The meta-gradient for this task is:

\nabla_\theta L_{query}(\theta') = \nabla_\theta L_{query}(\theta - \alpha \nabla_\theta L_{supp}(\theta))

Applying the chain rule yields:

\nabla_\theta L_{query}(\theta') = \nabla_{\theta'} L_{query}(\theta') \cdot \nabla_\theta (\theta - \alpha \nabla_\theta L_{supp}(\theta))

\nabla_\theta L_{query}(\theta') = \nabla_{\theta'} L_{query}(\theta') \cdot (I - \alpha \nabla_\theta^2 L_{supp}(\theta))

Here, $\nabla_{\theta'} L_{query}(\theta')$ is the gradient of the query loss with respect to the adapted parameters $\theta'$ , evaluated at $\theta'$ . The term $\nabla_\theta^2 L_{supp}(\theta)$ is the Hessian matrix of the support set loss with respect to the initial parameters $\theta$ .

This calculation requires computing the gradient $\nabla_{\theta'} L_{query}(\theta')$ , the Hessian $\nabla_\theta^2 L_{supp}(\theta)$ , multiplying the Hessian by $\alpha$ and the gradient vector, and performing a matrix subtraction and multiplication. This involves second-order derivatives, making standard MAML computationally demanding.

MAML Algorithm Pseudocode

Here is a simplified representation of the MAML algorithm:

Algorithm: MAML

Require: Distribution over tasks $p(\mathcal{T})$ Require: Step sizes $\alpha, \beta$

Initialize $\theta$ randomly
while not converged do
\hspace{0.5cm} Sample batch of tasks $\mathcal{T}_i \sim p(\mathcal{T})$ for $i=1, \dots, B$
\hspace{0.5cm} for all $\mathcal{T}_i$ do
\hspace{1cm} Evaluate $\nabla_\theta \mathcal{L}_{\mathcal{T}_i}(\theta, D_{\mathcal{T}_i}^{supp})$ using the support set
\hspace{1cm} Compute adapted parameters $\theta'_i = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(\theta, D_{\mathcal{T}_i}^{supp})$ (Inner update)
\hspace{0.5cm} end for
\hspace{0.5cm} Update $\theta \leftarrow \theta - \beta \nabla_\theta \sum_{i=1}^B \mathcal{L}_{\mathcal{T}_i}(\theta'_i, D_{\mathcal{T}_i}^{query})$ (Outer update, requires backpropagating through step 6 using query sets)
end while
Return $\theta$

Computational Considerations

The primary computational challenge in MAML lies in the outer loop update (step 8), specifically the calculation of the meta-gradient involving second-order derivatives.

Hessian Computation/Hessian-Vector Products: Explicitly forming the Hessian matrix $\nabla_\theta^2 L_{supp}(\theta)$ is computationally infeasible for deep learning models with millions or billions of parameters, as its size is $d \times d$ , where $d$ is the number of parameters. Fortunately, the meta-gradient calculation only requires the product of the Hessian and a vector ( $\nabla_{\theta'} L_{query}(\theta')$ ). This Hessian-vector product (HVP) can often be computed efficiently without forming the full Hessian, typically using finite differences or automatic differentiation techniques (e.g., Pearlmutter's trick involving a second backward pass). However, even computing HVPs adds significant computational overhead compared to standard first-order gradient calculations.
Memory Usage: Standard implementations using automatic differentiation frameworks require storing the computation graph of the inner loop update(s) to perform the backward pass for the outer loop gradient. This graph includes intermediate activations and gradients, substantially increasing memory requirements, especially when multiple inner loop steps are used or when dealing with large foundation models.
Computational Graph: The overall computation involves a forward pass and backward pass for the inner loop (per task), followed by a forward pass using the adapted parameters on the query set, and finally a backward pass for the meta-gradient calculation which itself involves computation related to the inner loop's gradient. This nested structure contributes to the overall computational cost.

Relationship between meta-parameters (θ), task-adapted parameters (θ'_i), support/query losses, and gradient flow in MAML. The outer loop optimizes θ based on query set performance after adaptation, requiring backpropagation through the inner loop's gradient update step (red arrow), which involves second-order derivatives.

These computational demands motivate the development of approximations like First-Order MAML (FOMAML) and alternative approaches like Implicit MAML (iMAML), which we will examine next. Understanding the exact mechanism and cost of MAML provides the necessary foundation for appreciating these more scalable variants.