Up to this point, you have trained and evaluated your parameter-efficient adapters. The result of the fine-tuning process is a set of LoRA weights, which act as modifiers to the original model. To use the model efficiently in a production environment, you need to combine these updated weights with the base architecture.
This chapter focuses on the steps required to transition your model from a training artifact to a deployed application. You will begin by fusing the trained LoRA adapters back into the base model layers. For a base weight matrix and low-rank adapter matrices and , the merged weight matrix is calculated as:
This operation allows the model to run independently without the computational overhead of dynamically applying adapter layers during inference.
After merging the weights, you will export the final model into the Safetensors format. This format provides a secure and fast method for loading tensors directly into memory. You will then set up vLLM to host the model for high-throughput local inference. Finally, you will write a RESTful API to wrap the served model so external applications can programmatically send text generation requests. By the end of this section, you will have a fully functioning, task-specific language model deployed on a local server and ready to process incoming queries.
7.1 Merging LoRA Adapters with Base Models
7.2 Exporting Models to Safetensors
7.3 Serving SLMs with vLLM
7.4 API Integration Strategies
7.5 Practice: Deploying the Custom Model Locally