Model-Agnostic Meta-Learning (MAML) stands as a prominent gradient-based meta-learning algorithm. Its central premise is to find a set of initial model parameters that are highly amenable to rapid adaptation. Instead of learning parameters that perform well on average across tasks, MAML learns parameters that require only a few gradient updates on a small amount of data from a new task to achieve good performance on that specific task.
Let's formalize this. Consider a distribution of tasks . During meta-training, we sample a batch of tasks . Each task is associated with a loss function and typically has a support set for adaptation and a query set for evaluating the adapted parameters.
The core idea involves a two-step optimization process:
Inner Loop (Task-Specific Adaptation): For each task , starting from the shared initial parameters , we perform one or more gradient descent steps using the task's support set . For a single gradient step with learning rate :
These adapted parameters are specific to task .
Outer Loop (Meta-Optimization): The goal is to update the initial parameters to minimize the expected loss across tasks after adaptation. The meta-objective function is the sum (or average) of the losses computed using the adapted parameters on their respective query sets :
Substituting the expression for (from the inner loop), the objective becomes:
The meta-parameters are updated using gradient descent based on this meta-objective, typically using a meta-learning rate :
Calculating the meta-gradient is the most intricate part of MAML. Since depends on through a gradient update step, applying the chain rule involves differentiating the inner loop's gradient.
Let's consider a single task and simplify notation: and . The adapted parameters are . The meta-gradient for this task is:
Applying the chain rule yields:
Here, is the gradient of the query loss with respect to the adapted parameters , evaluated at . The term is the Hessian matrix of the support set loss with respect to the initial parameters .
This calculation requires computing the gradient , the Hessian , multiplying the Hessian by and the gradient vector, and performing a matrix subtraction and multiplication. This involves second-order derivatives, making standard MAML computationally demanding.
Here is a simplified representation of the MAML algorithm:
Algorithm: MAML
Require: Distribution over tasks Require: Step sizes
The primary computational challenge in MAML lies in the outer loop update (step 8), specifically the calculation of the meta-gradient involving second-order derivatives.
Hessian Computation/Hessian-Vector Products: Explicitly forming the Hessian matrix is computationally infeasible for deep learning models with millions or billions of parameters, as its size is , where is the number of parameters. Fortunately, the meta-gradient calculation only requires the product of the Hessian and a vector (). This Hessian-vector product (HVP) can often be computed efficiently without forming the full Hessian, typically using finite differences or automatic differentiation techniques (e.g., Pearlmutter's trick involving a second backward pass). However, even computing HVPs adds significant computational overhead compared to standard first-order gradient calculations.
Memory Usage: Standard implementations using automatic differentiation frameworks require storing the computation graph of the inner loop update(s) to perform the backward pass for the outer loop gradient. This graph includes intermediate activations and gradients, substantially increasing memory requirements, especially when multiple inner loop steps are used or when dealing with large foundation models.
Computational Graph: The overall computation involves a forward pass and backward pass for the inner loop (per task), followed by a forward pass using the adapted parameters on the query set, and finally a backward pass for the meta-gradient calculation which itself involves computation related to the inner loop's gradient. This nested structure contributes to the overall computational cost.
Relationship between meta-parameters (θ), task-adapted parameters (θ'_i), support/query losses, and gradient flow in MAML. The outer loop optimizes θ based on query set performance after adaptation, requiring backpropagation through the inner loop's gradient update step (red arrow), which involves second-order derivatives.
These computational demands motivate the development of approximations like First-Order MAML (FOMAML) and alternative approaches like Implicit MAML (iMAML), which we will examine next. Understanding the exact mechanism and cost of MAML provides the necessary foundation for appreciating these more scalable variants.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with