Masterclass
Applying compression techniques like quantization, pruning, or knowledge distillation introduces a fundamental trade-off: gains in efficiency (smaller size, faster inference, lower memory usage) often come at the cost of some degradation in model performance. Choosing the right compression strategy and configuration requires careful evaluation to find an acceptable balance for your specific application needs. This section provides guidance on how to systematically measure and compare these trade-offs.
To understand the impact of compression, we need to measure changes along two primary axes: model performance and resource efficiency.
The choice of performance metric depends heavily on how the LLM will be used. It's essential to evaluate the metrics that are most relevant to your deployment scenario.
The efficiency gains from compression should be measured practically on the target deployment hardware and software stack.
Before evaluating compressed models, you must establish a solid baseline. Measure the performance and efficiency metrics of your original, uncompressed model on the target hardware and evaluation datasets. This baseline serves as the reference point against which all compressed versions will be compared.
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assume evaluate_perplexity and evaluate_downstream_task functions exist
# Assume get_model_size_mb and measure_latency functions exist
# --- Configuration ---
model_id = "your_original_llm_checkpoint"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
eval_dataset = [...] # Your evaluation dataset
test_prompt = "Once upon a time"
num_tokens_to_generate = 50
# --- Load Original Model ---
original_model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
original_model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)
# --- Baseline Evaluation ---
with torch.no_grad():
baseline_perplexity = evaluate_perplexity(
original_model, tokenizer, eval_dataset, device
)
baseline_downstream_score = evaluate_downstream_task(
original_model, tokenizer, ...
)
baseline_size_mb = get_model_size_mb(original_model)
# Measure latency (example for generation)
inputs = tokenizer(test_prompt, return_tensors="pt").to(device)
start_time = time.time()
_ = original_model.generate(**inputs, max_new_tokens=num_tokens_to_generate)
end_time = time.time()
baseline_latency_ms = (
(end_time - start_time) * 1000 / num_tokens_to_generate
) # Approx ms per token
print(f"Baseline Metrics:")
print(f" Perplexity: {baseline_perplexity:.2f}")
print(f" Downstream Score: {baseline_downstream_score:.4f}")
print(f" Size (MB): {baseline_size_mb:.1f}")
print(f" Latency (ms/token): {baseline_latency_ms:.1f}")
# Store these baseline values for comparison
baseline_metrics = {
"perplexity": baseline_perplexity,
"downstream_score": baseline_downstream_score,
"size_mb": baseline_size_mb,
"latency_ms_per_token": baseline_latency_ms,
}
Once you have a baseline, apply different compression techniques and configurations, then re-evaluate using the same metrics and procedures.
Quantization: Compare Post-Training Quantization (PTQ) at different bit levels (e.g., INT8, INT4) and Quantization-Aware Training (QAT). PTQ is simpler but might cause a larger performance drop, especially at lower bit widths. QAT requires more effort (retraining) but often preserves performance better. Evaluate both model accuracy and actual speedup on target hardware, as theoretical speedups don't always materialize without optimized kernels.
Pruning: Evaluate different sparsity levels (e.g., 20%, 40%, 60% sparsity). Compare unstructured pruning (removing individual weights) versus structured pruning (removing entire neurons or attention heads). While unstructured pruning might offer higher compression ratios for a given accuracy drop, structured pruning often leads to more significant practical speedups on standard hardware due to regularity. Measure the performance drop as sparsity increases.
Knowledge Distillation: Train student models of varying sizes or architectures. Evaluate the student model's performance against both the original baseline and the larger teacher model. The trade-off involves the student's training cost versus its final size, speed, and performance relative to the baseline.
Scatter plots are effective for visualizing the relationship between performance and efficiency. Plot a performance metric (e.g., downstream task accuracy) on one axis and an efficiency metric (e.g., latency or model size) on the other. Each point represents a specific compressed model configuration.
Performance accuracy on a specific task versus the average latency per token generated. Each point represents a different model version (baseline or compressed). Lower latency (left) and higher accuracy (top) are generally preferred.
Such plots help identify the "Pareto front" – the set of models where you cannot improve one metric (e.g., reduce latency) without sacrificing another (e.g., lowering accuracy). Models on this front represent the best achievable trade-offs for the evaluated configurations.
It is absolutely essential to perform efficiency evaluations (latency, throughput, memory usage) on the specific hardware and software environment intended for deployment.
Therefore, measuring ms/token
or tokens/sec
requires running the model within the target deployment stack. Simple FLOP counts or parameter counts are insufficient proxies for real-world speed.
There is rarely a single "best" compressed model. The optimal choice is dictated by the constraints and requirements of your specific application:
The evaluation process is often iterative. You might try several compression techniques and settings, measure their impact using the methods described above, visualize the trade-offs, and select the configuration that best meets your specific performance targets and resource budgets. Always compare against the uncompressed baseline to understand the relative cost and benefit of each compression approach.
© 2025 ApX Machine Learning