Now that we've explored the various metrics and methodologies for evaluating quantized LLMs, let's put this knowledge into practice. This section provides a hands-on guide to setting up and executing a benchmark comparing a quantized LLM against its full-precision counterpart. We will measure performance characteristics like latency, throughput, and memory usage, alongside evaluating the impact on model quality using perplexity.
Our goal is to obtain concrete data points that illustrate the trade-offs involved in quantization, enabling informed decisions about model deployment.
Before starting, ensure you have the necessary environment set up. This practical assumes you have access to a machine with a CUDA-enabled GPU and have installed the required Python libraries.
pip install transformers
pip install accelerate
pip install datasets
pip install evaluate
bitsandbytes
quantized model) pip install bitsandbytes
pip install auto-gptq
or pip install autoawq
For this example, we will compare a baseline FP16 model (e.g., meta-llama/Llama-2-7b-hf
) with a corresponding INT4 quantized version (e.g., using GPTQ). You will need to replace model identifiers with the specific models you are evaluating. Ensure the quantized model is compatible with your environment and libraries (e.g., requires auto-gptq
for GPTQ).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import evaluate
from datasets import load_dataset
import numpy as np
# Configuration
baseline_model_id = "meta-llama/Llama-2-7b-hf" # Replace with your baseline model
quantized_model_id = "TheBloke/Llama-2-7B-GPTQ" # Replace with your quantized model
device = "cuda" if torch.cuda.is_available() else "cpu"
num_samples = 50 # Number of samples for benchmarking latency/throughput
max_new_tokens = 100 # Number of tokens to generate
perplexity_dataset = "wikitext"
perplexity_dataset_config = "wikitext-2-raw-v1"
perplexity_split = "test"
perplexity_max_samples = 50 # Reduce for faster evaluation
# Sample prompt for generation tasks
prompt = "The field of Large Language Models is "
print(f"Using device: {device}")
if device == "cpu":
print("Warning: Benchmarking on CPU will be significantly slower and memory usage patterns differ.")
# --- Helper Functions ---
def load_model_and_tokenizer(model_id, is_quantized=False):
print(f"Loading model: {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token # Set pad token if missing
model_kwargs = {"device_map": "auto"}
if is_quantized:
# Add quantization specific loading params if needed
# Example for AutoGPTQ:
# model_kwargs["use_safetensors"] = True
# model_kwargs["trust_remote_code"] = True # Be cautious with trust_remote_code
pass # Add specific loading arguments for your quantized model type here
model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
model.eval() # Set model to evaluation mode
print("Model loaded.")
return model, tokenizer
def measure_latency(model, tokenizer, prompt, max_new_tokens, num_runs=10):
print("Measuring latency...")
latencies = []
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Warm-up run
with torch.no_grad():
_ = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.pad_token_id)
torch.cuda.synchronize() # Ensure GPU operation is complete
for _ in range(num_runs):
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
with torch.no_grad():
_ = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.pad_token_id)
end_event.record()
torch.cuda.synchronize() # Wait for the operation to complete
latency_ms = start_event.elapsed_time(end_event)
latencies.append(latency_ms)
# print(f"Run latency: {latency_ms:.2f} ms") # Optional: print individual run latency
avg_latency = np.mean(latencies)
print(f"Average latency ({num_runs} runs): {avg_latency:.2f} ms")
return avg_latency
def measure_throughput(model, tokenizer, prompt, max_new_tokens, num_samples=50):
print("Measuring throughput...")
inputs = tokenizer(prompt, return_tensors="pt").to(device)
total_tokens_generated = 0
total_time_sec = 0
# Warm-up
with torch.no_grad():
_ = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.pad_token_id)
torch.cuda.synchronize()
start_time = time.time()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
with torch.no_grad():
# Note: This simple loop doesn't parallelize requests.
# Real-world throughput often involves batching or concurrent requests.
# For simplicity, we measure sequential generation speed here.
outputs = model.generate(**inputs, max_new_tokens=num_samples * max_new_tokens, pad_token_id=tokenizer.pad_token_id, do_sample=False) # Generate a long sequence
generated_tokens = outputs[0][inputs.input_ids.shape[1]:].size(0) # Count generated tokens excluding prompt
total_tokens_generated = generated_tokens
end_event.record()
torch.cuda.synchronize()
total_time_sec = start_event.elapsed_time(end_event) / 1000.0 # Time in seconds
# Alternative: CPU timing (less precise for GPU operations)
# total_time_sec = time.time() - start_time
throughput_tokens_per_sec = total_tokens_generated / total_time_sec if total_time_sec > 0 else 0
print(f"Total tokens generated: {total_tokens_generated}")
print(f"Total time: {total_time_sec:.2f} sec")
print(f"Throughput: {throughput_tokens_per_sec:.2f} tokens/sec")
return throughput_tokens_per_sec
def measure_memory_usage(model_load_fn, *args, **kwargs):
print("Measuring peak memory usage...")
torch.cuda.reset_peak_memory_stats(device)
initial_memory = torch.cuda.max_memory_allocated(device)
# Load the model within this function to capture its memory footprint
model, tokenizer = model_load_fn(*args, **kwargs)
memory_after_load = torch.cuda.max_memory_allocated(device)
print(f"Memory after load: {memory_after_load / (1024**3):.2f} GB")
# Perform a sample inference run to capture runtime memory
inputs = tokenizer("Sample text for memory measurement.", return_tensors="pt").to(device)
with torch.no_grad():
_ = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.pad_token_id)
torch.cuda.synchronize()
peak_memory = torch.cuda.max_memory_allocated(device)
peak_memory_gb = peak_memory / (1024**3) # Convert bytes to GB
print(f"Peak memory usage during inference: {peak_memory_gb:.2f} GB")
# Clean up memory
del model
del tokenizer
torch.cuda.empty_cache()
return peak_memory_gb
def calculate_perplexity(model, tokenizer, dataset_name, dataset_config, split, max_samples=50):
print(f"Calculating perplexity on {dataset_name} ({split} split)...")
try:
perplexity_metric = evaluate.load("perplexity", module_type="metric")
data = load_dataset(dataset_name, dataset_config, split=f"{split}[:{max_samples}]") # Use a slice for faster eval
data = data.map(lambda examples: tokenizer(examples["text"]), batched=True) # Use default settings
results = perplexity_metric.compute(model=model,
tokenizer=tokenizer,
data=data["text"], # Pass raw text
batch_size=1, # Adjust batch size based on GPU memory
device=device)
ppl = results["perplexity"]
print(f"Perplexity: {ppl:.4f}")
return ppl
except Exception as e:
print(f"Error calculating perplexity: {e}")
print("Skipping perplexity calculation.")
return None
# --- Benchmarking Execution ---
results = {}
# Benchmark Baseline Model
print("\n--- Benchmarking Baseline Model ---")
# Measure memory separately to capture peak usage during load
baseline_memory = measure_memory_usage(load_model_and_tokenizer, baseline_model_id, is_quantized=False)
# Load again for other benchmarks
baseline_model, baseline_tokenizer = load_model_and_tokenizer(baseline_model_id, is_quantized=False)
baseline_latency = measure_latency(baseline_model, baseline_tokenizer, prompt, max_new_tokens)
baseline_throughput = measure_throughput(baseline_model, baseline_tokenizer, prompt, max_new_tokens, num_samples)
baseline_perplexity = calculate_perplexity(baseline_model, baseline_tokenizer, perplexity_dataset, perplexity_dataset_config, perplexity_split, perplexity_max_samples)
results["baseline"] = {
"latency_ms": baseline_latency,
"throughput_tokens_sec": baseline_throughput,
"peak_memory_gb": baseline_memory,
"perplexity": baseline_perplexity
}
# Clean up baseline model memory before loading quantized model
print("Cleaning up baseline model...")
del baseline_model
del baseline_tokenizer
torch.cuda.empty_cache()
print("Baseline model cleanup complete.")
# Benchmark Quantized Model
print("\n--- Benchmarking Quantized Model ---")
# Measure memory separately
quantized_memory = measure_memory_usage(load_model_and_tokenizer, quantized_model_id, is_quantized=True)
# Load again for other benchmarks
quantized_model, quantized_tokenizer = load_model_and_tokenizer(quantized_model_id, is_quantized=True)
quantized_latency = measure_latency(quantized_model, quantized_tokenizer, prompt, max_new_tokens)
quantized_throughput = measure_throughput(quantized_model, quantized_tokenizer, prompt, max_new_tokens, num_samples)
quantized_perplexity = calculate_perplexity(quantized_model, quantized_tokenizer, perplexity_dataset, perplexity_dataset_config, perplexity_split, perplexity_max_samples)
results["quantized"] = {
"latency_ms": quantized_latency,
"throughput_tokens_sec": quantized_throughput,
"peak_memory_gb": quantized_memory,
"perplexity": quantized_perplexity
}
# Clean up quantized model memory
print("Cleaning up quantized model...")
del quantized_model
del quantized_tokenizer
torch.cuda.empty_cache()
print("Quantized model cleanup complete.")
# --- Results Analysis ---
print("\n--- Benchmark Results Summary ---")
print(f"{'Metric':<25} {'Baseline':<15} {'Quantized':<15} {'Change (%)':<15}")
print("-" * 70)
# Latency
base_lat = results["baseline"]["latency_ms"]
quant_lat = results["quantized"]["latency_ms"]
lat_change = ((quant_lat - base_lat) / base_lat) * 100 if base_lat else 0
print(f"{'Avg Latency (ms)':<25} {base_lat:<15.2f} {quant_lat:<15.2f} {lat_change:<15.2f}")
# Throughput
base_thr = results["baseline"]["throughput_tokens_sec"]
quant_thr = results["quantized"]["throughput_tokens_sec"]
thr_change = ((quant_thr - base_thr) / base_thr) * 100 if base_thr else 0
print(f"{'Throughput (tokens/sec)':<25} {base_thr:<15.2f} {quant_thr:<15.2f} {thr_change:<15.2f}")
# Memory
base_mem = results["baseline"]["peak_memory_gb"]
quant_mem = results["quantized"]["peak_memory_gb"]
mem_change = ((quant_mem - base_mem) / base_mem) * 100 if base_mem else 0
print(f"{'Peak Memory (GB)':<25} {base_mem:<15.2f} {quant_mem:<15.2f} {mem_change:<15.2f}")
# Perplexity
base_ppl = results["baseline"]["perplexity"]
quant_ppl = results["quantized"]["perplexity"]
if base_ppl is not None and quant_ppl is not None:
ppl_change = ((quant_ppl - base_ppl) / base_ppl) * 100
print(f"{'Perplexity':<25} {base_ppl:<15.4f} {quant_ppl:<15.4f} {ppl_change:<15.2f}")
else:
print(f"{'Perplexity':<25} {'N/A':<15} {'N/A':<15} {'N/A':<15}")
print("-" * 70)
# Optional: Visualization
# Prepare data for plotting (replace with actual numbers from your run)
# These are illustrative placeholder values.
base_lat_val = results["baseline"]["latency_ms"] if results["baseline"]["latency_ms"] else 1000
quant_lat_val = results["quantized"]["latency_ms"] if results["quantized"]["latency_ms"] else 500
base_thr_val = results["baseline"]["throughput_tokens_sec"] if results["baseline"]["throughput_tokens_sec"] else 50
quant_thr_val = results["quantized"]["throughput_tokens_sec"] if results["quantized"]["throughput_tokens_sec"] else 100
base_mem_val = results["baseline"]["peak_memory_gb"] if results["baseline"]["peak_memory_gb"] else 15
quant_mem_val = results["quantized"]["peak_memory_gb"] if results["quantized"]["peak_memory_gb"] else 8
base_ppl_val = results["baseline"]["perplexity"] if results["baseline"]["perplexity"] else 5.0
quant_ppl_val = results["quantized"]["perplexity"] if results["quantized"]["perplexity"] else 5.5
```plotly
{"layout": {"title": "Baseline (FP16) vs. Quantized (INT4) LLM Performance", "barmode": "group", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Value"}, "legend_title_text": "Model Type", "height": 400}, "data": [{"type": "bar", "name": "Baseline (FP16)", "x": ["Latency (ms)", "Throughput (tok/s)", "Memory (GB)", "Perplexity"], "y": [base_lat_val, base_thr_val, base_mem_val, base_ppl_val], "marker": {"color": "#4263eb"}}, {"type": "bar", "name": "Quantized (INT4)", "x": ["Latency (ms)", "Throughput (tok/s)", "Memory (GB)", "Perplexity"], "y": [quant_lat_val, quant_thr_val, quant_mem_val, quant_ppl_val], "marker": {"color": "#12b886"}}]}
Comparison of performance and quality metrics between the baseline FP16 model and its INT4 quantized version. Lower is better for Latency, Memory, and Perplexity. Higher is better for Throughput. (Note: Values are illustrative).
The output table and chart provide a quantitative comparison. You should observe:
du -sh model_directory
).This practical benchmark provides essential data points. Remember that these results are specific to the hardware used, the chosen model, the quantization technique (e.g., GPTQ, AWQ, bitsandbytes
), and the specific benchmarking setup (batch size, sequence length, dataset). For production scenarios, you would extend this by:
By systematically benchmarking, you can confidently select and deploy quantized models that meet your performance requirements while maintaining acceptable quality.
© 2025 ApX Machine Learning