In previous sections, we discussed strategies for optimizing large language models for deployment, focusing on techniques like quantization to reduce model size and potentially accelerate inference. Now, let's put that theory into practice. This hands-on exercise guides you through applying post-training quantization to a smaller transformer model and deploying it using a simple web server. While we use a manageable model size for this practice, the workflow illustrates the fundamental steps involved in deploying optimized LLMs.Prerequisites:Before starting, ensure you have Python installed along with the necessary libraries. You can install them using pip:pip install torch transformers optimum[onnxruntime] fastapi uvicorn[standard] psutilWe'll use the Hugging Face transformers library to load a pre-trained model, optimum to handle the ONNX conversion and quantization process, torch as the backend, fastapi to create a simple web service, uvicorn to run the server, and psutil to check memory usage (as a proxy for model size).Step 1: Load the Base Model"First, let's load a standard pre-trained transformer model. We'll use distilbert-base-uncased, a smaller, faster version of BERT, which is suitable for this demonstration. In a practical LLMOps scenario, you would replace this with your specific large model checkpoint."import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer import os import psutil # Define model name and task model_name = "distilbert-base-uncased-finetuned-sst-2-english" task = "text-classification" # Optimum needs task for ONNX export # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_name) model_fp32 = AutoModelForSequenceClassification.from_pretrained(model_name) # Function to get model size in memory (approximate) def get_model_size_mb(model): mem_params = sum([param.nelement()*param.element_size() for param in model.parameters()]) mem_bufs = sum([buf.nelement()*buf.element_size() for buf in model.buffers()]) total_mem_bytes = mem_params + mem_bufs return total_mem_bytes / (1024**2) # Convert bytes to megabytes # Get size of the original FP32 model model_fp32_size = get_model_size_mb(model_fp32) print(f"Original FP32 model size: {model_fp32_size:.2f} MB") # Save the tokenizer for later use with the quantized model tokenizer.save_pretrained("./model_fp32") model_fp32.save_pretrained("./model_fp32") # Save FP32 model if needed for direct comparison later # Optional: Clean up memory if running in constrained environment # del model_fp32 # torch.cuda.empty_cache() # If using GPUThis code loads the tokenizer and the standard floating-point precision (FP32) model. We also define a helper function to estimate the model's size in memory.Step 2: Apply Post-Training Quantization using OptimumNow, we'll use the optimum library, which simplifies the process of optimizing models, including quantization via ONNX Runtime. We'll convert the PyTorch model to ONNX format and then apply dynamic quantization to INT8.from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification from optimum.onnxruntime.configuration import AutoQuantizationConfig # Define output directories onnx_fp32_path = "./onnx_fp32" onnx_int8_path = "./onnx_int8" os.makedirs(onnx_fp32_path, exist_ok=True) os.makedirs(onnx_int8_path, exist_ok=True) # 1. Export the model to ONNX FP32 format model_fp32_onnx = ORTModelForSequenceClassification.from_pretrained(model_name, export=True, task=task) model_fp32_onnx.save_pretrained(onnx_fp32_path) tokenizer.save_pretrained(onnx_fp32_path) # Save tokenizer with ONNX model print(f"FP32 ONNX model saved to: {onnx_fp32_path}") # 2. Create a Quantizer from the FP32 ONNX model quantizer = ORTQuantizer.from_pretrained(onnx_fp32_path, file_name="model.onnx") # 3. Define the quantization configuration (dynamic INT8) # AVX2 or VNNI instruction sets are often beneficial here if available qconfig = AutoQuantizationConfig.avx2(is_static=False, per_channel=False) # Dynamic quantization # 4. Apply quantization quantizer.quantize(save_dir=onnx_int8_path, quantization_config=qconfig) print(f"INT8 ONNX model saved to: {onnx_int8_path}") # Save tokenizer with the quantized model as well tokenizer.save_pretrained(onnx_int8_path) # Optional: Clean up intermediate ONNX model # del model_fp32_onnx # del quantizer # torch.cuda.empty_cache() # If using GPUThis process involves:Exporting the original PyTorch model to the ONNX format with FP32 precision. optimum handles this conversion.Loading the exported ONNX model into an ORTQuantizer object.Specifying the quantization configuration. We choose dynamic INT8 quantization suitable for AVX2 CPU instruction sets (common on modern processors). Dynamic quantization applies quantization to weights during model load and dynamically quantizes activations during inference.Running the quantize method, which applies the configuration and saves the INT8 quantized model to the specified directory.Step 3: Compare Model SizesA primary benefit of quantization is model size reduction. Let's compare the disk footprint of the FP32 ONNX model and the INT8 quantized ONNX model.import os def get_dir_size_mb(path='.'): total = 0 with os.scandir(path) as it: for entry in it: if entry.is_file(): total += entry.stat().st_size elif entry.is_dir(): total += get_dir_size_mb(entry.path) return total / (1024**2) # Convert bytes to megabytes # Calculate sizes fp32_onnx_size = get_dir_size_mb(onnx_fp32_path) int8_onnx_size = get_dir_size_mb(onnx_int8_path) print(f"FP32 ONNX model directory size: {fp32_onnx_size:.2f} MB") print(f"INT8 ONNX model directory size: {int8_onnx_size:.2f} MB") print(f"Size reduction: {(1 - int8_onnx_size / fp32_onnx_size) * 100:.2f}%") # Data for Plotly chart size_data = { "models": ["FP32 ONNX", "INT8 ONNX"], "sizes_mb": [round(fp32_onnx_size, 2), round(int8_onnx_size, 2)] }{"data": [{"x": ["FP32 ONNX", "INT8 ONNX"], "y": [252.94, 66.36], "type": "bar", "marker": {"color": ["#4c6ef5", "#12b886"]}, "name": "Model Size"}], "layout": {"title": "Model Size Comparison (Disk)", "yaxis": {"title": "Size (MB)"}, "xaxis": {"title": "Model Type"}, "template": "plotly_white", "width": 500, "height": 400}}Comparison of the approximate disk size for the FP32 and INT8 quantized ONNX models. Note: Exact sizes may vary slightly based on dependencies and file structure.You should observe a significant reduction in size, often close to 4x when moving from FP32 to INT8, as each parameter now requires only 8 bits instead of 32.Step 4: Load and Test the Quantized Model (Optional)Before deploying, let's quickly load the quantized model using optimum and run a sample inference to verify it works.from optimum.onnxruntime import ORTModelForSequenceClassification import time # Load the quantized model quantized_model = ORTModelForSequenceClassification.from_pretrained(onnx_int8_path) quantized_tokenizer = AutoTokenizer.from_pretrained(onnx_int8_path) # Sample text text = "This movie was absolutely fantastic!" inputs = quantized_tokenizer(text, return_tensors="pt") # PyTorch tensors expected by default # Warm-up run (optional, improves timing accuracy) _ = quantized_model(**inputs) # Time the inference start_time = time.time() outputs = quantized_model(**inputs) end_time = time.time() # Process output logits = outputs.logits predicted_class_id = torch.argmax(logits, dim=1).item() prediction = quantized_model.config.id2label[predicted_class_id] print(f"Input Text: '{text}'") print(f"Predicted Class: {prediction} (ID: {predicted_class_id})") print(f"Inference time (INT8 ONNX): {end_time - start_time:.4f} seconds") # Optional: Compare with FP32 ONNX model inference time # try: # fp32_model_onnx = ORTModelForSequenceClassification.from_pretrained(onnx_fp32_path) # fp32_tokenizer = AutoTokenizer.from_pretrained(onnx_fp32_path) # inputs_fp32 = fp32_tokenizer(text, return_tensors="pt") # _ = fp32_model_onnx(**inputs_fp32) # Warm-up # start_time_fp32 = time.time() # outputs_fp32 = fp32_model_onnx(**inputs_fp32) # end_time_fp32 = time.time() # print(f"Inference time (FP32 ONNX): {end_time_fp32 - start_time_fp32:.4f} seconds") # except Exception as e: # print(f"Could not run FP32 ONNX comparison: {e}") # Clean up models from memory # del quantized_model # if 'fp32_model_onnx' in locals(): del fp32_model_onnx # torch.cuda.empty_cache() # if using GPUWhile performance gains depend heavily on the hardware (CPU features like AVX2/VNNI, or GPU capabilities), quantization generally leads to faster inference due to reduced memory bandwidth requirements and potentially faster computation on integer units. The accuracy impact of post-training quantization should also be evaluated on a relevant dataset, although for many models and tasks, INT8 quantization maintains acceptable accuracy.Step 5: Deploy the Quantized Model with FastAPINow, let's create a simple web service using FastAPI to serve our quantized INT8 model. This server will load the model and tokenizer, accept text input via HTTP requests, and return the classification prediction.Create a file named serve_quantized.py:from fastapi import FastAPI from pydantic import BaseModel from transformers import AutoTokenizer from optimum.onnxruntime import ORTModelForSequenceClassification import torch import time import os # Define the path to the quantized model MODEL_DIR = "./onnx_int8" # Check if model directory exists if not os.path.exists(MODEL_DIR): raise RuntimeError(f"Model directory not found: {MODEL_DIR}. Please run the quantization steps first.") # Load the quantized model and tokenizer when the server starts try: print(f"Loading quantized model from {MODEL_DIR}...") tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR) model = ORTModelForSequenceClassification.from_pretrained(MODEL_DIR) # Perform a dummy inference to potentially optimize JIT compilation etc. _ = model(**tokenizer("warmup", return_tensors="pt")) print("Model loaded successfully.") except Exception as e: print(f"Error loading model: {e}") # Optionally exit or handle error appropriately raise RuntimeError("Failed to load the quantized model.") from e # Initialize FastAPI app app = FastAPI(title="Quantized Model Serving API") # Define the request body structure class InferenceRequest(BaseModel): text: str # Define the response body structure class InferenceResponse(BaseModel): prediction: str confidence: float latency_ms: float @app.post("/predict", response_model=InferenceResponse) async def predict(request: InferenceRequest): """ Performs inference on the input text using the quantized model. """ start_time = time.time() # Tokenize the input text inputs = tokenizer(request.text, return_tensors="pt", truncation=True, max_length=512) # Perform inference with torch.no_grad(): # Ensure gradients are not calculated outputs = model(**inputs) # Process the output logits = outputs.logits probabilities = torch.softmax(logits, dim=1) predicted_class_id = torch.argmax(probabilities, dim=1).item() confidence = probabilities[0, predicted_class_id].item() prediction_label = model.config.id2label[predicted_class_id] end_time = time.time() latency_ms = (end_time - start_time) * 1000 return InferenceResponse( prediction=prediction_label, confidence=round(confidence, 4), latency_ms=round(latency_ms, 2) ) @app.get("/") async def root(): return {"message": "Quantized Model Server is running. Use POST /predict to make predictions."} # If running this script directly, start the Uvicorn server if __name__ == "__main__": import uvicorn # Ensure the host is accessible if running inside Docker or VM uvicorn.run(app, host="0.0.0.0", port=8000)This script defines:Loading logic: It loads the tokenizer and the quantized ORTModelForSequenceClassification from the ./onnx_int8 directory when the server starts.API Endpoint (/predict): Accepts POST requests with JSON containing {"text": "your input text"}.Inference Logic: Tokenizes the input, runs inference using the loaded quantized model, calculates probabilities, and determines the predicted label.Response: Returns a JSON object with the prediction, confidence score, and processing latency.Step 6: Run and Test the ServerNow, run the FastAPI server from your terminal in the same directory where you ran the previous Python code and where the onnx_int8 directory exists:python serve_quantized.pyUvicorn will start the server, typically listening on http://127.0.0.1:8000.You can test the endpoint using curl from another terminal:curl -X POST "http://127.0.0.1:8000/predict" \ -H "Content-Type: application/json" \ -d '{"text": "This framework makes deployment so much easier!"}'Or using Python with the requests library:import requests import json url = "http://127.0.0.1:8000/predict" payload = {"text": "Optimum and ONNX Runtime provide great optimizations."} headers = {"Content-Type": "application/json"} response = requests.post(url, data=json.dumps(payload), headers=headers) if response.status_code == 200: print("Request successful!") print(response.json()) else: print(f"Request failed with status code: {response.status_code}") print(response.text) payload = {"text": "I am not sure about this, it seems quite complicated."} response = requests.post(url, data=json.dumps(payload), headers=headers) if response.status_code == 200: print("\nRequest successful!") print(response.json()) else: print(f"\nRequest failed with status code: {response.status_code}") print(response.text)You should receive JSON responses containing the sentiment classification (POSITIVE or NEGATIVE for this model), confidence score, and the time taken for the prediction.Summary and Next StepsIn this practice, you successfully applied post-training dynamic quantization to a transformer model using Hugging Face optimum and ONNX Runtime. You observed the reduction in model size and deployed the optimized model using a simple FastAPI server.This exercise demonstrates the core workflow:Optimize: Apply techniques like quantization (or pruning, distillation) to the trained model.Package: Save the optimized model and its dependencies (like the tokenizer configuration).Serve: Load the optimized model into an inference server (simple like FastAPI or specialized like Triton/vLLM) accessible via an API.While we used a relatively small model and dynamic quantization for simplicity, the principles extend to larger models and other optimization techniques discussed in this chapter. For production LLM deployment, you would typically:Use more sophisticated inference servers (NVIDIA Triton, vLLM, TensorRT-LLM) designed for high throughput and low latency GPU inference.Explore static quantization or Quantization-Aware Training (QAT) for potentially better performance and accuracy, especially on specific hardware accelerators.Implement packaging using containers (e.g., Docker).Integrate the deployment into a CI/CD pipeline with automated testing and rollout strategies (canary, blue/green).Set up comprehensive monitoring for performance, cost, and model drift, as covered in the next chapter.