Preparing Models for Inference

Transitioning a model from training to deployment requires a shift in focus from training performance to inference efficiency. Serving models to users operates under different environmental constraints compared to training. For production, the primary goals include achieving low latency, high throughput, and a minimal memory footprint. The final engineering steps involve converting a fine-tuned model from a training artifact into a high-performance asset suitable for deployment.

Reducing Model Size with Quantization

One of the most effective methods for optimizing a model is quantization. This process involves reducing the numerical precision of the model's weights. Most models are trained using 32-bit floating-point numbers (FP32). Quantization converts these weights to a lower-precision format, such as 8-bit integers (INT8) or even 4-bit floating-point numbers (FP4).

This conversion has two main benefits:

Smaller Model Size: Reducing precision from 32 bits to 8 bits can decrease the model's storage and memory footprint by a factor of four.
Faster Computation: Many modern CPUs and GPUs can perform integer arithmetic much faster than floating-point arithmetic, leading to lower inference latency.

While this process can result in a minor drop in model accuracy, the performance gains are often substantial, making it a standard practice for deployment. The Hugging Face ecosystem, integrated with libraries like bitsandbytes, makes this straightforward.

Here is how you can load a model with 8-bit quantization:

from transformers import AutoModelForCausalLM

# Load the model with 8-bit quantization enabled
model_name = "mistralai/Mistral-7B-v0.1"
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True
)

print(model_8bit.get_memory_footprint())

Loading in 4-bit precision, as popularized by QLoRA, offers even greater memory savings and is also easily accessible.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# Load the model with the 4-bit configuration
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config
)

print(model_4bit.get_memory_footprint())

The diagram below illustrates this reduction in data precision.

The quantization process converts high-precision floating-point weights into lower-precision formats, reducing memory usage and often speeding up computation.

Accelerating Inference with Model Compilation

Another powerful optimization is model compilation. When you run a standard PyTorch model, the Python interpreter introduces overhead. A model compiler transforms the dynamic graph of operations into a static, optimized version that is tailored for the specific hardware it will run on. This process can involve fusing multiple operations into a single kernel, which minimizes memory movement and computational overhead.

PyTorch 2.0 introduced torch.compile(), a function that provides access to state-of-the-art compilation technologies. Applying it is often a one-line change that can yield significant speedups.

import torch
from transformers import AutoModelForCausalLM

# Load your fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./my-finetuned-model")
model.to("cuda")
model.eval() # Set the model to evaluation mode

# Compile the model
# 'max-autotune' mode makes the compiler spend more time 
# looking for the fastest possible kernels.
compiled_model = torch.compile(model, mode="max-autotune")

# Use the compiled_model for inference as you normally would
# The first run will be slow due to the compilation process
# Subsequent runs will be much faster.

When using torch.compile(), the first inference pass will be slower because the model is being optimized and compiled in the background. Subsequent calls will use the cached, optimized code and run much faster.

Optimizing the Generation Loop

The way you generate text has a large impact on performance. The generate() method in transformers is highly configurable, and a few parameters are especially important for inference speed.

One of the most significant parameters is use_cache. In autoregressive models, each new token is generated based on all previously generated tokens. The attention mechanism needs to process this growing sequence at each step. The use_cache argument (which is True by default) allows the model to store the internal states (the key-value pairs) of the attention layers. For the next token, it only needs to compute attention for the newest token instead of re-computing for the entire sequence. Disabling this cache would dramatically slow down generation for anything but the shortest outputs.

Additionally, using greedy decoding (do_sample=False) is faster than sampling methods like top-k or nucleus sampling, as it avoids the overhead of constructing a probability distribution and sampling from it.

Exporting to a Portable Format with ONNX

While a compiled PyTorch model is fast, it is still tied to the Python and PyTorch ecosystem. For maximum portability and to use dedicated, high-performance inference engines, you can export your model to the Open Neural Network Exchange (ONNX) format.

ONNX provides a standardized format for machine learning models. Once a model is in ONNX format, you can run it using an ONNX-compatible runtime, such as ONNX Runtime. These runtimes are written in C++ and are highly optimized for a wide range of hardware, including CPUs, GPUs, and specialized AI accelerators.

The transformers.onnx module helps with this conversion.

from transformers.onnx import export, OnnxConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "./my-finetuned-model"
onnx_path = "./my-finetuned-model-onnx"

# 1. Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# 2. Define the ONNX configuration
class MyOnnxConfig(OnnxConfig):
    @property
    def inputs(self):
        # Defines the model's inputs for the ONNX graph
        return {
            "input_ids": {0: "batch_size", 1: "sequence_length"},
            "attention_mask": {0: "batch_size", 1: "sequence_length"},
        }

# 3. Export the model
onnx_config = MyOnnxConfig.from_model_config(model.config)
export(tokenizer, model, onnx_config, onnx_config.outputs, onnx_path)

This process creates a set of .onnx files that represent your model's computational graph. You can then load these files with an inference server like ONNX Runtime or NVIDIA Triton Inference Server for production deployment.

The following diagram outlines a complete inference preparation pipeline, starting from a fine-tuned model and ending with a deployable ONNX artifact.

A typical pipeline for preparing a model for deployment involves optional quantization followed by conversion to a standard format like ONNX, which can then be served by a dedicated inference engine.

By applying these techniques, you can transform your fine-tuned model into an efficient and scalable service capable of handling production workloads. Each step offers a trade-off between performance, size, and accuracy, allowing you to tailor the final model to your specific application needs.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 arXiv preprint arXiv:2305.14314 DOI: 10.48550/arXiv.2305.14314 - Introduces the 4-bit quantization method for efficient LLM fine-tuning and inference.
Quantization, Hugging Face, 2024 (Hugging Face) - Provides guidance for applying quantization, including 8-bit and 4-bit methods, to models within the transformers library.
torch.compile, PyTorch Contributors, 2024 - Official documentation detailing how to use torch.compile to optimize PyTorch models for faster execution.
Text generation strategies, Hugging Face, 2024 (Hugging Face) - Explains various parameters and strategies for text generation in the transformers library, including the use_cache argument.
Open Neural Network Exchange (ONNX), ONNX Community, 2024 - The official website and standard for an open format to represent machine learning models, enabling model portability and optimized inference.