Alright, let's put theory into practice. In the previous sections, we discussed metrics and strategies for evaluating quantized models. Now, you'll get hands-on experience benchmarking a quantized Large Language Model against its original counterpart. This practical exercise will help solidify your understanding of how to measure the real-world impact of quantization on both performance (speed, memory) and accuracy.
As we discussed, quantization is fundamentally about trade-offs. We aim to reduce the computational resources required (like memory footprint M and inference latency L) while minimizing the impact on the model's predictive quality, often measured by accuracy or perplexity. This exercise involves quantifying these exact trade-offs.
Before we begin, ensure you have the necessary libraries installed. We'll primarily use Hugging Face libraries for model loading, generation, and evaluation. You'll also need torch
for core functionalities and potentially bitsandbytes
if you're loading models quantized with it, or optimum
for GPTQ/GGUF formats.
pip install transformers torch accelerate datasets evaluate optimum bitsandbytes sentencepiece
# If using GGUF models via ctransformers backend in optimum:
pip install optimum[exporters,ctransformers]
# Or if using AutoGPTQ backend in optimum:
pip install optimum[exporters,auto-gptq]
For this exercise, you'll need two versions of an LLM:
bitsandbytes
) or even QAT.You can either use a model you quantized in a previous exercise or load pre-quantized models available on the Hugging Face Hub. For consistency and faster execution, consider using a smaller model like gpt2
, distilgpt2
, or a small variant of Mistral/Llama if available in both original and quantized forms. Ensure you have access to a CUDA-enabled GPU for meaningful performance measurements.
Let's define our model identifiers. Replace these with the actual Hugging Face Hub IDs or local paths for your chosen models.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import numpy as np
# --- Configuration ---
base_model_id = "gpt2" # Example: Replace with your base model (e.g., "meta-llama/Llama-2-7b-hf")
quantized_model_id = "gpt2" # Example: Replace with your quantized model ID or path
# If loading 4-bit with bitsandbytes, the ID might be the same, but loading differs
load_in_4bit = True # Set to True if using bitsandbytes 4-bit quantization on load
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# --- Load Tokenizer (usually the same for both) ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token # Set pad token if missing
# --- Load Base Model ---
print(f"Loading base model: {base_model_id}")
# Load in half-precision if using GPU and model supports it
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16 if device == "cuda" else torch.float32,
device_map=device # Simple loading to the target device
)
base_model.eval() # Set to evaluation mode
print("Base model loaded.")
# --- Load Quantized Model ---
print(f"Loading quantized model: {quantized_model_id}")
if load_in_4bit and device == "cuda":
from transformers import BitsAndBytesConfig
# Example: Loading with bitsandbytes 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16 # Optional: compute dtype
)
quantized_model = AutoModelForCausalLM.from_pretrained(
quantized_model_id,
quantization_config=bnb_config,
device_map=device # bitsandbytes handles device placement
)
print("Quantized model loaded using bitsandbytes 4-bit config.")
# Add alternative loading methods here if needed (e.g., using Optimum for GPTQ/GGUF)
# elif quantized_model_id.endswith(".gguf"): # Example using Optimum with CTransformers
# from optimum.quanto import CtransformersModelForCausalLM
# quantized_model = CtransformersModelForCausalLM.from_pretrained(quantized_model_id, device_map=device)
# print("Quantized GGUF model loaded using Optimum.")
else:
# Assume loading a standard Hugging Face model (e.g., if QAT was used or it's pre-quantized differently)
# Adjust dtype and loading if necessary based on the specific quantized format
quantized_model = AutoModelForCausalLM.from_pretrained(
quantized_model_id,
torch_dtype=torch.float16 if device == "cuda" else torch.float32, # May need adjustment
device_map=device
)
print("Quantized model loaded (adjust loading logic if not using bitsandbytes 4-bit).")
quantized_model.eval() # Set to evaluation mode
Make sure to adapt the loading logic for the
quantized_model
based on how it was quantized and saved. The example shows loading viabitsandbytes
. If you have a GPTQ or GGUF model, you would typically use libraries likeoptimum
with the appropriate backend (auto-gptq
orctransformers
).
Now, let's measure speed and memory usage. We'll look at latency (time per generation) and peak memory consumption.
Latency is the time taken to process a single input. We can measure it by timing the generate
function. For accurate GPU timing, it's important to synchronize before starting and stopping the timer.
# --- Latency Benchmark ---
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Warm-up run (important for CUDA)
_ = base_model.generate(**inputs, max_new_tokens=50)
_ = quantized_model.generate(**inputs, max_new_tokens=50)
if device == "cuda":
torch.cuda.synchronize()
# Measure Base Model Latency
start_time = time.perf_counter()
_ = base_model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)
if device == "cuda":
torch.cuda.synchronize()
end_time = time.perf_counter()
base_latency = end_time - start_time
print(f"Base Model Latency: {base_latency:.4f} seconds")
# Measure Quantized Model Latency
start_time = time.perf_counter()
_ = quantized_model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)
if device == "cuda":
torch.cuda.synchronize()
end_time = time.perf_counter()
quantized_latency = end_time - start_time
print(f"Quantized Model Latency: {quantized_latency:.4f} seconds")
if base_latency > 0:
speedup = base_latency / quantized_latency
print(f"Speedup: {speedup:.2f}x")
else:
print("Base latency was zero, cannot calculate speedup.")
Run this multiple times or average over several prompts for more stable results. You should observe lower latency for the quantized model, especially on hardware with native support for lower-precision operations.
We can estimate peak GPU memory usage during inference using PyTorch's memory management functions.
# --- Memory Benchmark ---
def get_peak_memory_mb(model, inputs):
"""Measures peak GPU memory usage for a model generation."""
if device != "cuda":
print("Memory measurement only available for CUDA devices.")
return 0
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
try:
_ = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)
torch.cuda.synchronize()
peak_memory = torch.cuda.max_memory_allocated() / (1024**2) # Convert bytes to MB
except Exception as e:
print(f"Error during memory measurement: {e}")
peak_memory = 0
torch.cuda.empty_cache()
return peak_memory
# Measure Base Model Memory
base_memory_mb = get_peak_memory_mb(base_model, inputs)
print(f"Base Model Peak Memory: {base_memory_mb:.2f} MB")
# Measure Quantized Model Memory
quantized_memory_mb = get_peak_memory_mb(quantized_model, inputs)
print(f"Quantized Model Peak Memory: {quantized_memory_mb:.2f} MB")
if base_memory_mb > 0:
memory_reduction = base_memory_mb / quantized_memory_mb
print(f"Memory Reduction Factor: {memory_reduction:.2f}x")
print(f"Memory Savings: {(1 - (quantized_memory_mb / base_memory_mb)) * 100:.2f}%")
else:
print("Base memory usage was zero or could not be measured.")
The quantized model should exhibit significantly lower peak memory usage, directly correlating with the reduced precision of its weights (and potentially activations, depending on the method).
Performance gains are only valuable if the model remains sufficiently accurate for its intended task. We need to evaluate the quantized model's quality using appropriate metrics.
Perplexity is a common intrinsic metric for language models. It measures how well a model predicts a sequence of text. Lower perplexity generally indicates a better model. We can use the evaluate
library and a standard dataset like WikiText.
import evaluate
from datasets import load_dataset
from tqdm import tqdm
# --- Perplexity Evaluation ---
try:
perplexity_metric = evaluate.load("perplexity", module_type="metric")
# Use a small subset for faster evaluation in this example
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test[:1%]")
# Or use the full test set: split="test"
# Preprocess text data (concatenate and chunk) for perplexity calculation
# This approach processes the entire dataset text as one sequence
encodings = tokenizer("\n\n".join(dataset["text"]), return_tensors="pt")
max_length = base_model.config.n_positions # Use model's max sequence length
stride = 512 # How much overlap between chunks
seq_len = encodings.input_ids.size(1)
nlls_base = []
nlls_quantized = []
prev_end_loc = 0
print("Calculating perplexity...")
for begin_loc in tqdm(range(0, seq_len, stride)):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc # may be different from stride on last loop
input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100 # Ignore loss calculation for overlapping tokens
with torch.no_grad():
# Base model
outputs_base = base_model(input_ids, labels=target_ids)
neg_log_likelihood_base = outputs_base.loss * trg_len # Scale loss by target length
nlls_base.append(neg_log_likelihood_base)
# Quantized model
outputs_quantized = quantized_model(input_ids, labels=target_ids)
neg_log_likelihood_quantized = outputs_quantized.loss * trg_len # Scale loss by target length
nlls_quantized.append(neg_log_likelihood_quantized)
prev_end_loc = end_loc
if end_loc == seq_len:
break
# Calculate Perplexity
ppl_base = torch.exp(torch.stack(nlls_base).sum() / end_loc)
ppl_quantized = torch.exp(torch.stack(nlls_quantized).sum() / end_loc)
print(f"\nBase Model Perplexity: {ppl_base.item():.4f}")
print(f"Quantized Model Perplexity: {ppl_quantized.item():.4f}")
print(f"Perplexity Increase: {ppl_quantized.item() - ppl_base.item():.4f}")
except ImportError:
print("Please install 'evaluate' and 'datasets' to run perplexity benchmark.")
except Exception as e:
print(f"Could not run perplexity evaluation: {e}")
This perplexity calculation uses a sliding window approach common for causal LMs. Note that calculating perplexity over a large dataset can be time-consuming. Using a smaller subset (
split="test[:1%]"
) provides a quick estimate.
Alternatively, or additionally, you can evaluate the models on a specific task they might be used for, such as question answering, summarization, or sentiment analysis. This often provides a more direct measure of practical utility. Frameworks like the lm-evaluation-harness
are designed for this purpose, offering standardized evaluation setups across many tasks. Setting up lm-evaluation-harness
is more involved and outside the scope of this basic practical, but it's the standard for rigorous LLM evaluation.
For a simpler task-based evaluation within this notebook, you could adapt the generation loop to run on a task-specific dataset (e.g., BoolQ for yes/no questions) and compute accuracy.
Now, consolidate your findings. A simple table is often effective:
Metric | Base Model | Quantized Model | Change |
---|---|---|---|
Latency (seconds) | base_latency |
quant_latency |
speedup x faster |
Peak Memory (MB) | base_memory |
quant_memory |
reduction x smaller |
Perplexity | ppl_base |
ppl_quantized |
increase higher |
Task Accuracy (%) | (if measured) | (if measured) | (difference) |
(Replace placeholder values with your measured results)
Visualize the trade-off. A simple scatter plot comparing speedup/memory reduction against the accuracy drop (e.g., percentage increase in perplexity) can be insightful.
Performance metrics like latency and memory reduction vs. evaluation metrics like perplexity increase. The example shows a hypothetical 1.85x speedup with a 5.2% increase in perplexity for a 4-bit quantized model. Replace values with your own measurements.
Interpret the results:
This hands-on benchmarking process provides concrete data to guide decisions about deploying quantized models, ensuring you balance efficiency gains with the required level of predictive performance.
© 2025 ApX Machine Learning