Masterclass
While static data mixtures provide a fixed recipe for training, large language models might benefit from a more dynamic approach to data presentation, akin to how a curriculum guides learning. Instead of feeding the model the same proportion of data sources throughout the entire training run, we can dynamically adjust the mixture or the sampling process itself over time. This is the core idea behind data pacing and annealing schedules. These techniques aim to optimize the learning process by controlling when and how different parts of the dataset are emphasized.
Data pacing refers to the strategy of controlling the rate at which the model is exposed to different subsets or types of data during training. Think of it as managing the flow of information based on the model's learning progress. For example, you might start training primarily on a high-quality, curated corpus to establish strong foundational language understanding. As training progresses and the model becomes more capable, you can gradually increase the proportion of data from noisier sources like web crawls or specialized domains like code repositories.
The pacing can be based on various criteria:
Implementing data pacing often involves creating a schedule function that modifies the sampling weights or probabilities associated with different data sources based on the current training step or epoch.
import numpy as np
# Example data sources and initial weights
data_sources = {
"curated_corpus": {"weight": 0.7, "path": "/path/to/curated"},
"web_crawl": {"weight": 0.3, "path": "/path/to/web"},
# Add more sources as needed
}
# Total training steps
total_steps = 1_000_000
def get_pacing_weights(current_step, total_steps, sources):
"""Calculates dynamic weights based on training progress."""
progress = current_step / total_steps
new_weights = {}
# Example: Linearly decrease curated data weight, increase web crawl weight
# Ensure weights still sum to 1 (or normalize afterwards)
curated_weight = max(0.1, 0.7 - 0.6 * progress) # Don't go below 10%
web_weight = 1.0 - curated_weight # Assuming only two sources for simplicity
new_weights["curated_corpus"] = curated_weight
new_weights["web_crawl"] = web_weight
# Add logic for other sources if present
# Normalize weights to ensure they sum to 1 (important!)
total_weight = sum(new_weights.values())
normalized_weights = {
k: v / total_weight
for k, v in new_weights.items()
}
return normalized_weights
# --- Inside the training loop ---
# current_training_step = ... get current step ...
# current_weights = get_pacing_weights(
# current_training_step, total_steps, data_sources)
# Reconfigure the data sampler/loader with these new weights
# sampler.set_weights(current_weights)
# batch = next(data_loader)
# ... rest of training step ...
This snippet illustrates a simple linear pacing schedule. More complex functions (e.g., exponential, step-based) can be designed depending on the desired learning trajectory. The important part is dynamically adjusting the likelihood of sampling from each source as training advances.
Annealing, in the context of data sampling, typically refers to gradually changing a parameter that controls the shape or randomness of the sampling distribution over the data sources. This is closely related to the temperature-based sampling discussed previously.
Recall that temperature T modifies the probabilities pi derived from weights wi:
pi=∑jexp(wj/T)exp(wi/T)An annealing schedule might involve starting with a higher temperature T>1 and gradually reducing it towards T=1 (or even slightly lower) over the course of training.
Annealing schedules can be applied to the temperature parameter or directly to the weights themselves (e.g., gradually increasing the difference between high and low weights). Common schedules include linear decay, cosine decay, or exponential decay of the temperature parameter or a similar modulation factor applied to the weights.
Example temperature annealing schedules over 1 million training steps, starting from T=2.0 and decaying towards T=1.0 using linear and cosine functions.
Data pacing and annealing are not mutually exclusive. You can combine them to create sophisticated data feeding strategies. For instance:
This combination allows fine-grained control over both what data the model sees and how strictly it adheres to the prescribed mixture at different training stages.
Implementing dynamic sampling schedules adds complexity to the training pipeline.
torch.utils.data.WeightedRandomSampler
can be adapted, or custom iterators managing multiple datasets might be needed.While more complex than static sampling, data pacing and annealing schedules offer powerful tools for guiding the learning process of large language models. By thoughtfully controlling the data diet over time, you can potentially accelerate convergence, improve robustness, and better shape the final capabilities of the model to meet specific requirements.
© 2025 ApX Machine Learning