Merging LoRA Adapters with Base Models

When preparing a fine-tuned model for production, running the base architecture alongside separate adapter layers introduces unnecessary computational overhead. Keeping the low-rank matrices separate is standard during training to reduce memory footprints and update parameters efficiently. However, for inference, dynamically adding the outputs of the adapter layers to the base model layers at every forward pass slows down text generation. To optimize performance, trained LoRA adapters can be fused directly into the original model weights.

Recall the mathematical operation of Low-Rank Adaptation. For a given base weight matrix $W$ and low-rank adapter matrices $A$ and $B$ , the merged weight matrix $W'$ is calculated as:

$W' = W + \frac{\alpha}{r} (AB)$

Here, $\alpha$ is the scaling factor and $r$ is the rank parameter configured before training. Once this addition is performed, the matrices $A$ and $B$ are no longer needed. The new weight matrix $W'$ operates exactly like the original base model but contains the specialized behavior acquired during fine-tuning.

Process of fusing low-rank adapter matrices into the base model weights to create a single deployment-ready model.

In practice, the Hugging Face peft library simplifies this mathematical operation into a single method call. You will use the merge_and_unload() function. This function performs the matrix multiplication and addition, then unloads the separate adapter instances from memory. The result is a standard Hugging Face Transformers model object.

To perform the merge, you must first load the base model and then attach the adapter weights.

import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model_id = "your-small-base-model"
adapter_path = "./lora-adapters"

# Load the base model in 16-bit precision
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load the PEFT model combining base weights and adapters
model = PeftModel.from_pretrained(base_model, adapter_path)

# Merge the weights and unload the adapters
merged_model = model.merge_and_unload()

Merging requires loading the base model and adapters in full or half precision, such as float16 or bfloat16. If you used QLoRA for training with a 4-bit quantized base model, you cannot directly merge the adapters back into the 4-bit weights. The merging mathematical operations require consistent, non-quantized tensor types to prevent severe precision degradation.

You must load the base model in float16, apply the adapters, and perform the merge. This requirement temporarily demands more system RAM or GPU VRAM than the training phase. If your local machine lacks sufficient VRAM to hold the unquantized model, you can force the merging process to occur on the system CPU by setting device_map="cpu" during the initial model loading. While this makes the operation slower, it prevents out-of-memory errors on limited hardware.

Once merge_and_unload() completes, the resulting model is structurally identical to a standard, non-fine-tuned model. It no longer relies on the peft library to function, making it highly compatible with optimized inference engines and serving frameworks.

References

LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021 arXiv preprint arXiv:2106.09685 DOI: 10.48550/arXiv.2106.09685 - The original research paper that introduces the mathematical foundation for adding low-rank matrices to base weights without inference latency.
PEFT Documentation: Merging Adapters, Hugging Face, 2024 - Official documentation for the PEFT library detailing the implementation of weight merging and the behavior of the merge_and_unload method.
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 Advances in Neural Information Processing Systems, Vol. 36 DOI: 10.48550/arXiv.2305.14314 - Provides technical context on why merging with 4-bit quantized base models requires specific precision handling to maintain model performance.