Architectural Considerations for Full Fine-Tuning

Selecting a base model for full fine-tuning involves more than picking the one with the highest benchmark scores. The model's underlying architecture dictates its suitability for your specific task and, more practically, whether you have the computational resources to update all of its weights. Since full fine-tuning modifies every parameter, the initial architectural choice has significant and direct consequences on memory usage, training time, and the final performance of your specialized model.

Matching Architecture to Task

Language models are not a monolith. Their design is often optimized for specific types of problems. The three primary architectural families you will encounter are encoder-only, decoder-only, and encoder-decoder.

Decoder-only (e.g., GPT, Llama, Mistral): These models are the standard for generative tasks. They process text sequentially and are designed to predict the next token in a sequence, making them ideal for text generation, chatbots, and instruction-following. Their autoregressive nature means they excel at open-ended content creation. When your goal is to teach a model a new conversational style or a specific generative skill, a decoder-only model is almost always the correct choice.
Encoder-only (e.g., BERT, RoBERTa): These models are designed to build a deep understanding of an entire text input by analyzing both left and right context simultaneously. This makes them unsuitable for text generation but highly effective for natural language understanding (NLU) tasks. If your objective is classification, sentiment analysis, or named entity recognition, fine-tuning an encoder-only model will yield better results with fewer resources than a much larger decoder-only model.
Encoder-Decoder (e.g., T5, BART): These models combine both an encoder and a decoder, making them a powerful choice for sequence-to-sequence tasks. The encoder processes the source text to create a rich representation, and the decoder uses that representation to generate a new target text. This architecture is a natural fit for machine translation, text summarization, and question answering where the input needs to be transformed into a new output format.

In summary, of when to choose which architecture:

Generation or Chat: Use a Decoder-only model.
Classification or Extraction: Use an Encoder-only model.
Translation or Summarization: Use an Encoder-Decoder model.

The Impact of Model Size and Recent Innovations

The number of parameters in a model is the most direct factor influencing the hardware requirements for full fine-tuning. A 7-billion parameter model requires substantially less GPU memory than a 70-billion parameter model. However, size is not the only factor. Modern architectures incorporate features that can significantly reduce the computational overhead.

One of the most important innovations is in the attention mechanism. The original Transformer architecture used Multi-Head Attention (MHA), where each attention head has its own set of Query, Key, and Value projection matrices. Newer models often use more efficient variants:

Multi-Query Attention (MQA): All attention heads share a single Key and Value projection. This dramatically reduces the memory required to store the KV cache during inference and lowers the memory bandwidth requirements during training.
Grouped-Query Attention (GQA): This is a compromise between MHA and MQA. It groups several heads to share a single key and Value projection, offering a balance between the performance of MHA and the efficiency of MQA.

Models like Llama 2 and Mistral employ these optimized attention mechanisms. Choosing a model with GQA or MQA can make full fine-tuning more manageable on your available hardware, as it reduces the memory footprint of one of the model's most demanding components.

Diagram of attention mechanisms. MHA uses unique Key (K) and Value (V) projections per head, while GQA groups heads to share projections, and MQA uses a single K/V projection for all heads, reducing memory requirements.

Another architectural pattern to be aware of is the Mixture of Experts (MoE). In an MoE model like Mixtral 8x7B, the model contains multiple "expert" sub-networks. For any given input token, a gating network routes the computation to a small subset of these experts. While the total parameter count is high (e.g., ~47B for Mixtral), only a fraction of those parameters (e.g., ~13B) are used for any forward pass. This makes inference very fast. However, for full fine-tuning, you must load all experts into memory, meaning the VRAM requirement corresponds to the total parameter count, not the active parameter count.

A Checklist for Selecting Your Base Model

Before committing to a base model for full fine-tuning, review the following points:

Task and Architecture Alignment: Does the model's architecture (decoder, encoder, etc.) match your end goal? Using the wrong type of model will lead to poor performance, no matter how much you tune it.
Computational Budget: Full fine-tuning is memory-intensive. As a rough estimate, training in full precision (32-bit) with the standard AdamW optimizer requires approximately four times the model's parameter size in GPU VRAM (e.g., a 7B model needs ~28 GB VRAM). Mixed-precision training can reduce this, but it remains a significant constraint. Choose a model that fits within your hardware limits.
Model License: Always check the model's license to ensure it permits your intended use case, especially for commercial applications. Models like Llama 2 have specific use restrictions, while others like Falcon and Mistral use more permissive licenses like Apache 2.0.
Community and Ecosystem Support: Is the model well-supported in libraries like Hugging Face transformers and datasets? A strong community provides valuable resources, pre-existing fine-tuning scripts, and a support network for troubleshooting issues that may arise during your project. Choosing a popular, well-supported model can save you a great deal of time and effort.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, Multi-Head Attention, and the foundational encoder-decoder design.
Mixtral of Experts, Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, 2024 (arXiv) DOI: 10.48550/arXiv.2401.04088 - Introduces the Mixtral model, an example of a Mixture of Experts (MoE) architecture that enables efficient inference for large models.