Understanding the meta-learning problem structure, with its tasks, support sets (Si), and query sets (Qi), is the first step. Now, we need a way to organize the diverse array of algorithms designed to solve this problem. Meta-learning algorithms are typically categorized into three main families based on their core mechanism for enabling fast adaptation. These categories provide a useful framework, although some algorithms might incorporate elements from more than one. The primary perspectives are:
Let's examine each category in more detail.
A categorization of common meta-learning approaches based on their primary mechanism for achieving fast adaptation.
Algorithms in this family aim to find a set of initial model parameters θ that are highly sensitive to changes based on new task data. The meta-objective is typically to minimize the loss on the query sets Qi after a small number of gradient updates have been performed using the corresponding support sets Si.
The most prominent example is Model-Agnostic Meta-Learning (MAML). In MAML, for each task Ti sampled during meta-training, an inner loop performs one or more standard gradient descent steps on the support set Si to obtain task-adapted parameters θi′.
θi′=θ−α∇θLSi(θ)The outer loop then updates the initial parameters θ by differentiating through this inner update step, using the performance on the query set Qi. This requires computing second-order derivatives (gradients of gradients), often referred to as the meta-gradient.
θ←θ−β∇θTi∑LQi(θi′)=θ−βTi∑∇θLQi(θ−α∇θLSi(θ))Variants like First-Order MAML (FOMAML) and Reptile simplify this by ignoring second-order terms, significantly reducing computational cost but potentially altering the optimization dynamics. Implicit MAML (iMAML) uses implicit differentiation to compute meta-gradients more stably and efficiently, particularly for many inner steps.
For foundation models, the high dimensionality of θ makes computing and storing full second-order gradients extremely expensive. This motivates the use of first-order approximations or techniques specifically designed for scale, which we will examine in Chapter 2.
Instead of optimizing parameters for gradient updates, metric-based methods learn an embedding function fϕ that maps inputs into a space where similarity corresponds to class membership or task relevance. Adaptation typically involves comparing query examples to support examples in this learned space using a distance metric (e.g., Euclidean distance) or a learned similarity function.
Prototypical Networks are a well-known example. They compute a single "prototype" representation ck for each class k present in the support set Si by averaging the embeddings of its examples: ck=∣Si,k∣1∑xj∈Si,kfϕ(xj). Classification of a query point xq is then done via a softmax over distances to the prototypes:
p(y=k∣xq)=∑k′exp(−d(fϕ(xq),ck′))exp(−d(fϕ(xq),ck))Other notable approaches include Matching Networks, which use attention mechanisms to compute weighted combinations of support set examples for query prediction, and Relation Networks, which employ a separate neural network module to learn a non-linear similarity score between query and support example embeddings.
These methods often rely heavily on the quality of the learned embedding function fϕ. When working with foundation models, one might leverage the powerful pre-trained representations directly or fine-tune the embedding function during meta-training. Adapting these techniques to extremely high-dimensional embeddings from foundation models presents unique challenges, explored further in Chapter 3.
This perspective views meta-learning more broadly as learning aspects of the optimization process itself. Instead of just learning an initial parameter set θ, these methods might learn the update rule, the learning rates, or other optimization hyperparameters.
One line of work is Learning to Optimize (L2O), where a separate neural network (the meta-learner, often an RNN like LSTM) is trained to output the parameter updates for the base model (the learner). The meta-learner observes the gradients and states of the learner and proposes updates Δθt intended to minimize the task loss rapidly.
θt+1=θt+ΔθtwhereΔθt=gψ(∇θtL(θt),ht)Here, gψ is the meta-learner parameterized by ψ, and ht is its internal state. The parameters ψ are trained across many tasks to produce efficient optimization trajectories.
Furthermore, many meta-learning algorithms, including MAML, can be formally understood through the lens of Bilevel Optimization. The outer optimization adjusts meta-parameters (like initial weights θ in MAML, or the meta-learner ψ in L2O) to minimize an outer objective (e.g., query set loss after adaptation). The inner optimization adapts parameters to a specific task (e.g., minimizing support set loss). This viewpoint, discussed in Chapter 4, provides a powerful framework for analyzing and developing meta-learning algorithms. Learning optimal initializations, while often associated with gradient-based methods, is also fundamentally an optimization problem addressed within this framework.
Each category has strengths and weaknesses. Gradient-based methods directly optimize for adaptation performance via gradients but can be computationally intensive and sensitive to optimization details. Metric-based methods are often simpler to implement and computationally cheaper at test time but rely heavily on the quality of the embedding space and the chosen metric. Optimization-based methods offer flexibility in tailoring the learning process but can be complex to train and analyze.
The choice often depends on the specific problem, the nature of the tasks, computational budget, and whether the goal is primarily fast adaptation of existing parameters or learning effective representations from scratch. In the context of foundation models, scalability and efficient use of pre-trained knowledge become primary considerations, influencing the suitability of different approaches and motivating hybrid strategies or specialized adaptation techniques, as we will see in later chapters. Understanding these fundamental categories is essential for navigating the advanced techniques required for adapting large-scale models effectively.
© 2025 ApX Machine Learning