Full parameter fine-tuning, while powerful, is sensitive to its configuration. Selecting appropriate hyperparameters is fundamental for achieving good performance without wasting significant computational resources or time. Unlike pre-training, where hyperparameter regimes are often established through massive trial-and-error, fine-tuning requires careful adjustment based on the specific task, dataset size, and model architecture. Let's examine the most influential hyperparameters and strategies for tuning them effectively.
The learning rate is arguably the single most important hyperparameter. It determines the step size taken during gradient descent to update the model weights θ:
θnew=θold−η∇L(θold)Where ∇L(θold) is the gradient of the loss function L with respect to the old weights.
For full fine-tuning of large pre-trained models, the learning rate is typically set much lower than rates used during pre-training. This is because we want to adapt the existing knowledge gently, preserving the powerful representations learned during pre-training while specializing the model for the new task. Common starting points for fine-tuning LLMs often fall within the range of 1×10−5 to 5×10−5. It's rare to exceed 1×10−4 for full fine-tuning.
Learning Rate Schedulers: Using a constant learning rate throughout training is usually suboptimal. Learning rate schedulers dynamically adjust the learning rate during training, often improving convergence speed and final performance. Common strategies include:
Linear Warmup with Decay: This is a very common schedule for fine-tuning transformers. Training starts with a very small learning rate, which gradually increases linearly over a set number of "warmup" steps. After the warmup phase, the learning rate typically decays linearly (or sometimes polynomially) towards zero over the remaining training steps. The warmup phase helps stabilize training early on, especially when gradients might be large or noisy, preventing the model from diverging. The subsequent decay allows for finer adjustments as the model approaches convergence.
Example of a linear warmup (0 to 100 steps) followed by linear decay learning rate schedule.
Cosine Annealing: Here, the learning rate starts at its maximum value (after any warmup) and decays following a cosine curve towards a minimum value (often zero). This provides a smoother decay compared to linear decay and can sometimes lead to better exploration of the loss surface.
Constant Warmup: Similar to linear warmup, but the learning rate jumps to its maximum value after the warmup phase and then decays according to some schedule (e.g., linear, cosine).
The choice of scheduler and its parameters (warmup steps, decay function) are themselves hyperparameters that often require tuning.
The batch size specifies the number of training examples used in a single forward and backward pass to compute the gradient before updating the model weights.
The maximum feasible batch size is often constrained by available GPU memory. Techniques like gradient accumulation (discussed in Chapter 7) allow simulating larger batch sizes by computing gradients over several smaller batches before performing a weight update, mitigating memory constraints at the cost of slightly increased computation time. Typical batch sizes for full fine-tuning range from 4 to 64, heavily depending on the model size and GPU memory.
An epoch represents one complete pass through the entire training dataset.
Determining the optimal number of epochs is usually done empirically using a validation set. Monitor the performance (e.g., loss, task-specific metrics) on the validation set after each epoch (or fraction of an epoch). Stop training when performance on the validation set ceases to improve or starts to degrade, a technique called early stopping. Fine-tuning often requires only a few epochs (e.g., 1-5) due to the strong initialization provided by the pre-trained model, especially with larger datasets. Smaller datasets might appear to benefit from more epochs, but this increases the risk of overfitting, making regularization more important.
While various optimizers exist, AdamW (Adam with decoupled weight decay) is the standard and generally recommended optimizer for fine-tuning transformer models.
Other optimizers like Adafactor might be considered in extremely memory-constrained scenarios, but AdamW is the common starting point. Hyperparameters specific to the optimizer, like betas (β1,β2) and epsilon (ϵ) in AdamW, are usually left at their default values (e.g., β1=0.9,β2=0.999,ϵ=1e−8), although tuning them can occasionally yield marginal gains.
Weight decay is a regularization technique (equivalent to L2 regularization) that adds a penalty to the loss function proportional to the squared magnitude of the model weights. This discourages the model from learning overly complex patterns by keeping the weights small, thus reducing overfitting.
The weight decay coefficient is another hyperparameter to tune. Typical values for fine-tuning often range from 0.01 to 0.1. Setting it to 0 disables weight decay. As mentioned, its effectiveness is often tied to using an optimizer like AdamW that handles it properly.
Finding the optimal combination of hyperparameters can be complex. Instead of manual trial-and-error, more systematic approaches are often employed, especially when computational budget allows:
Grid Search: Define a discrete set of values for each hyperparameter you want to tune. Train the model for every possible combination of these values. While exhaustive, it becomes computationally infeasible very quickly as the number of hyperparameters and values per hyperparameter increases (curse of dimensionality).
Random Search: Define a range or distribution for each hyperparameter. Randomly sample combinations of hyperparameters from these ranges/distributions and train the model for each sample. Surprisingly, random search is often more efficient than grid search because it explores the hyperparameter space more broadly, potentially finding good combinations faster, especially when only a few hyperparameters significantly impact performance.
Comparison of exploration patterns for Grid Search versus Random Search over two hyperparameters (Learning Rate and Batch Size). Grid Search covers points systematically, while Random Search samples across the space.
Bayesian Optimization: A more sophisticated approach where the results from previous training runs inform the selection of the next hyperparameter combination to try. It builds a probabilistic model (often using Gaussian Processes) mapping hyperparameters to performance metrics (e.g., validation loss). It uses this model to balance exploration (trying hyperparameters in regions with high uncertainty) and exploitation (trying hyperparameters near the best-performing points found so far). Tools like Optuna, Hyperopt, or Ray Tune provide implementations of Bayesian optimization and other advanced tuning algorithms.
Mastering hyperparameter tuning for full fine-tuning is a blend of understanding the underlying principles, applying systematic search strategies, and incorporating practical considerations based on the specific task and available resources. It remains a somewhat empirical process, but a structured approach significantly increases the chances of finding a configuration that yields optimal performance.
© 2025 ApX Machine Learning