Meta-learning fundamentally shifts the learning objective from mastering a single task to acquiring the ability to learn new tasks rapidly and efficiently. Unlike traditional supervised learning where a model is trained on a large dataset for one specific job, meta-learning operates on a distribution of tasks. The objective is to extract transferable knowledge or learning strategies that accelerate adaptation when faced with a novel task, particularly when data for that new task is scarce. This is often framed as "learning to learn."
At the heart of the meta-learning problem lies the concept of a task. A task T represents a specific learning problem, such as classifying a new set of images, translating between a novel pair of languages, or adapting a language model to a unique writing style. In the meta-learning framework, we assume tasks are drawn from an underlying probability distribution p(T). This distribution defines the universe of problems the meta-learning algorithm is expected to handle.
The meta-learning process typically involves two phases:
Crucially, each individual task Ti within the meta-training or meta-testing set is itself structured as a small learning problem. It comprises two distinct subsets of data:
This division into support and query sets within each task is fundamental. It simulates the real-world few-shot scenario during meta-training: the model must learn from the support set how to perform well on the query set for that specific task.
Data flow within a single step of the meta-training phase. A task Ti is sampled, split into support (Si) and query (Qi) sets. The meta-model parameters (θ) are adapted using Si to yield task-specific parameters (ϕi). Performance is then evaluated on Qi, and the resulting loss informs the update to the meta-parameters θ.
Let θ represent the parameters of our meta-learned model or the parameters defining our learning procedure (e.g., initial weights of a neural network, parameters of an optimizer). The process of adapting these general parameters θ to task-specific parameters ϕi using the support set Si can be denoted by a function or algorithm Adapt. So, ϕi=Adapt(θ,Si).
The ultimate goal of meta-training is to find the optimal meta-parameters θ∗ that minimize the expected loss on the query sets Qi across the distribution of tasks p(T), after adaptation using the corresponding support sets Si. If L(Qi,ϕi) represents the loss (e.g., cross-entropy, mean squared error) of the adapted model ϕi on the query set Qi, the meta-objective can be formally stated as:
θ∗=argθminETi∼p(T)[L(Qi,ϕi)] θ∗=argθminETi∼p(T)[L(Qi,Adapt(θ,Si))]In practice, this expectation is approximated by averaging the query set loss over a batch of tasks sampled during each meta-training iteration.
This formulation directly applies when working with large foundation models. Here, θ represents the potentially massive set of parameters of the foundation model (e.g., a Transformer). The Adapt
function could be:
Regardless of the specific Adapt
mechanism, the meta-learning goal remains consistent: find initial parameters θ (or a way to generate them) such that the model performs well on the query set Qi after seeing only the small support set Si. The challenge lies in efficiently performing the meta-optimization (finding θ∗) and the adaptation (Adapt(θ,Si)) given the enormous scale of θ in foundation models, a central theme explored throughout this course. Understanding this core problem structure is essential before examining specific algorithms designed to solve it.
© 2025 ApX Machine Learning