Now that we've explored the various metrics and methodologies for evaluating quantized LLMs, let's put this knowledge into practice. This section provides a hands-on guide to setting up and executing a benchmark comparing a quantized LLM against its full-precision counterpart. We will measure performance characteristics like latency, throughput, and memory usage, alongside evaluating the impact on model quality using perplexity.Our goal is to obtain concrete data points that illustrate the trade-offs involved in quantization, enabling informed decisions about model deployment.Prerequisites and SetupBefore starting, ensure you have the necessary environment set up. This practical assumes you have access to a machine with a CUDA-enabled GPU and have installed the required Python libraries.Python: Version 3.8 or higher.PyTorch: Install with CUDA support.Transformers: pip install transformersAccelerate: pip install accelerateDatasets: pip install datasetsEvaluate: pip install evaluatebitsandbytes: (If benchmarking a bitsandbytes quantized model) pip install bitsandbytesAutoGPTQ/AutoAWQ: (If benchmarking GPTQ/AWQ models) pip install auto-gptq or pip install autoawqFor this example, we will compare a baseline FP16 model (e.g., meta-llama/Llama-2-7b-hf) with a corresponding INT4 quantized version (e.g., using GPTQ). You will need to replace model identifiers with the specific models you are evaluating. Ensure the quantized model is compatible with your environment and libraries (e.g., requires auto-gptq for GPTQ).import torch from transformers import AutoModelForCausalLM, AutoTokenizer import time import evaluate from datasets import load_dataset import numpy as np # Configuration baseline_model_id = "meta-llama/Llama-2-7b-hf" # Replace with your baseline model quantized_model_id = "TheBloke/Llama-2-7B-GPTQ" # Replace with your quantized model device = "cuda" if torch.cuda.is_available() else "cpu" num_samples = 50 # Number of samples for benchmarking latency/throughput max_new_tokens = 100 # Number of tokens to generate perplexity_dataset = "wikitext" perplexity_dataset_config = "wikitext-2-raw-v1" perplexity_split = "test" perplexity_max_samples = 50 # Reduce for faster evaluation # Sample prompt for generation tasks prompt = "The field of Large Language Models is " print(f"Using device: {device}") if device == "cpu": print("Warning: Benchmarking on CPU will be significantly slower and memory usage patterns differ.") # --- Helper Functions --- def load_model_and_tokenizer(model_id, is_quantized=False): print(f"Loading model: {model_id}...") tokenizer = AutoTokenizer.from_pretrained(model_id) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token # Set pad token if missing model_kwargs = {"device_map": "auto"} if is_quantized: # Add quantization specific loading params if needed # Example for AutoGPTQ: # model_kwargs["use_safetensors"] = True # model_kwargs["trust_remote_code"] = True # Be cautious with trust_remote_code pass # Add specific loading arguments for your quantized model type here model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs) model.eval() # Set model to evaluation mode print("Model loaded.") return model, tokenizer def measure_latency(model, tokenizer, prompt, max_new_tokens, num_runs=10): print("Measuring latency...") latencies = [] inputs = tokenizer(prompt, return_tensors="pt").to(device) # Warm-up run with torch.no_grad(): _ = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.pad_token_id) torch.cuda.synchronize() # Ensure GPU operation is complete for _ in range(num_runs): start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() with torch.no_grad(): _ = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.pad_token_id) end_event.record() torch.cuda.synchronize() # Wait for the operation to complete latency_ms = start_event.elapsed_time(end_event) latencies.append(latency_ms) # print(f"Run latency: {latency_ms:.2f} ms") # Optional: print individual run latency avg_latency = np.mean(latencies) print(f"Average latency ({num_runs} runs): {avg_latency:.2f} ms") return avg_latency def measure_throughput(model, tokenizer, prompt, max_new_tokens, num_samples=50): print("Measuring throughput...") inputs = tokenizer(prompt, return_tensors="pt").to(device) total_tokens_generated = 0 total_time_sec = 0 # Warm-up with torch.no_grad(): _ = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.pad_token_id) torch.cuda.synchronize() start_time = time.time() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() with torch.no_grad(): # Note: This simple loop doesn't parallelize requests. " # Throughput often involves batching or concurrent requests." # For simplicity, we measure sequential generation speed here. outputs = model.generate(**inputs, max_new_tokens=num_samples * max_new_tokens, pad_token_id=tokenizer.pad_token_id, do_sample=False) # Generate a long sequence generated_tokens = outputs[0][inputs.input_ids.shape[1]:].size(0) # Count generated tokens excluding prompt total_tokens_generated = generated_tokens end_event.record() torch.cuda.synchronize() total_time_sec = start_event.elapsed_time(end_event) / 1000.0 # Time in seconds # Alternative: CPU timing (less precise for GPU operations) # total_time_sec = time.time() - start_time throughput_tokens_per_sec = total_tokens_generated / total_time_sec if total_time_sec > 0 else 0 print(f"Total tokens generated: {total_tokens_generated}") print(f"Total time: {total_time_sec:.2f} sec") print(f"Throughput: {throughput_tokens_per_sec:.2f} tokens/sec") return throughput_tokens_per_sec def measure_memory_usage(model_load_fn, *args, **kwargs): print("Measuring peak memory usage...") torch.cuda.reset_peak_memory_stats(device) initial_memory = torch.cuda.max_memory_allocated(device) # Load the model within this function to capture its memory footprint model, tokenizer = model_load_fn(*args, **kwargs) memory_after_load = torch.cuda.max_memory_allocated(device) print(f"Memory after load: {memory_after_load / (1024**3):.2f} GB") # Perform a sample inference run to capture runtime memory inputs = tokenizer("Sample text for memory measurement.", return_tensors="pt").to(device) with torch.no_grad(): _ = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.pad_token_id) torch.cuda.synchronize() peak_memory = torch.cuda.max_memory_allocated(device) peak_memory_gb = peak_memory / (1024**3) # Convert bytes to GB print(f"Peak memory usage during inference: {peak_memory_gb:.2f} GB") # Clean up memory del model del tokenizer torch.cuda.empty_cache() return peak_memory_gb def calculate_perplexity(model, tokenizer, dataset_name, dataset_config, split, max_samples=50): print(f"Calculating perplexity on {dataset_name} ({split} split)...") try: perplexity_metric = evaluate.load("perplexity", module_type="metric") data = load_dataset(dataset_name, dataset_config, split=f"{split}[:{max_samples}]") # Use a slice for faster eval data = data.map(lambda examples: tokenizer(examples["text"]), batched=True) # Use default settings results = perplexity_metric.compute(model=model, tokenizer=tokenizer, data=data["text"], # Pass raw text batch_size=1, # Adjust batch size based on GPU memory device=device) ppl = results["perplexity"] print(f"Perplexity: {ppl:.4f}") return ppl except Exception as e: print(f"Error calculating perplexity: {e}") print("Skipping perplexity calculation.") return None # --- Benchmarking Execution --- results = {} # Benchmark Baseline Model print("\n--- Benchmarking Baseline Model ---") # Measure memory separately to capture peak usage during load baseline_memory = measure_memory_usage(load_model_and_tokenizer, baseline_model_id, is_quantized=False) # Load again for other benchmarks baseline_model, baseline_tokenizer = load_model_and_tokenizer(baseline_model_id, is_quantized=False) baseline_latency = measure_latency(baseline_model, baseline_tokenizer, prompt, max_new_tokens) baseline_throughput = measure_throughput(baseline_model, baseline_tokenizer, prompt, max_new_tokens, num_samples) baseline_perplexity = calculate_perplexity(baseline_model, baseline_tokenizer, perplexity_dataset, perplexity_dataset_config, perplexity_split, perplexity_max_samples) results["baseline"] = { "latency_ms": baseline_latency, "throughput_tokens_sec": baseline_throughput, "peak_memory_gb": baseline_memory, "perplexity": baseline_perplexity } # Clean up baseline model memory before loading quantized model print("Cleaning up baseline model...") del baseline_model del baseline_tokenizer torch.cuda.empty_cache() print("Baseline model cleanup complete.") # Benchmark Quantized Model print("\n--- Benchmarking Quantized Model ---") # Measure memory separately quantized_memory = measure_memory_usage(load_model_and_tokenizer, quantized_model_id, is_quantized=True) # Load again for other benchmarks quantized_model, quantized_tokenizer = load_model_and_tokenizer(quantized_model_id, is_quantized=True) quantized_latency = measure_latency(quantized_model, quantized_tokenizer, prompt, max_new_tokens) quantized_throughput = measure_throughput(quantized_model, quantized_tokenizer, prompt, max_new_tokens, num_samples) quantized_perplexity = calculate_perplexity(quantized_model, quantized_tokenizer, perplexity_dataset, perplexity_dataset_config, perplexity_split, perplexity_max_samples) results["quantized"] = { "latency_ms": quantized_latency, "throughput_tokens_sec": quantized_throughput, "peak_memory_gb": quantized_memory, "perplexity": quantized_perplexity } # Clean up quantized model memory print("Cleaning up quantized model...") del quantized_model del quantized_tokenizer torch.cuda.empty_cache() print("Quantized model cleanup complete.") # --- Results Analysis --- print("\n--- Benchmark Results Summary ---") print(f"{'Metric':<25} {'Baseline':<15} {'Quantized':<15} {'Change (%)':<15}") print("-" * 70) # Latency base_lat = results["baseline"]["latency_ms"] quant_lat = results["quantized"]["latency_ms"] lat_change = ((quant_lat - base_lat) / base_lat) * 100 if base_lat else 0 print(f"{'Avg Latency (ms)':<25} {base_lat:<15.2f} {quant_lat:<15.2f} {lat_change:<15.2f}") # Throughput base_thr = results["baseline"]["throughput_tokens_sec"] quant_thr = results["quantized"]["throughput_tokens_sec"] thr_change = ((quant_thr - base_thr) / base_thr) * 100 if base_thr else 0 print(f"{'Throughput (tokens/sec)':<25} {base_thr:<15.2f} {quant_thr:<15.2f} {thr_change:<15.2f}") # Memory base_mem = results["baseline"]["peak_memory_gb"] quant_mem = results["quantized"]["peak_memory_gb"] mem_change = ((quant_mem - base_mem) / base_mem) * 100 if base_mem else 0 print(f"{'Peak Memory (GB)':<25} {base_mem:<15.2f} {quant_mem:<15.2f} {mem_change:<15.2f}") # Perplexity base_ppl = results["baseline"]["perplexity"] quant_ppl = results["quantized"]["perplexity"] if base_ppl is not None and quant_ppl is not None: ppl_change = ((quant_ppl - base_ppl) / base_ppl) * 100 print(f"{'Perplexity':<25} {base_ppl:<15.4f} {quant_ppl:<15.4f} {ppl_change:<15.2f}") else: print(f"{'Perplexity':<25} {'N/A':<15} {'N/A':<15} {'N/A':<15}") print("-" * 70) # Optional: Visualization # Prepare data for plotting (replace with actual numbers from your run) # These are illustrative placeholder values. base_lat_val = results["baseline"]["latency_ms"] if results["baseline"]["latency_ms"] else 1000 quant_lat_val = results["quantized"]["latency_ms"] if results["quantized"]["latency_ms"] else 500 base_thr_val = results["baseline"]["throughput_tokens_sec"] if results["baseline"]["throughput_tokens_sec"] else 50 quant_thr_val = results["quantized"]["throughput_tokens_sec"] if results["quantized"]["throughput_tokens_sec"] else 100 base_mem_val = results["baseline"]["peak_memory_gb"] if results["baseline"]["peak_memory_gb"] else 15 quant_mem_val = results["quantized"]["peak_memory_gb"] if results["quantized"]["peak_memory_gb"] else 8 base_ppl_val = results["baseline"]["perplexity"] if results["baseline"]["perplexity"] else 5.0 quant_ppl_val = results["quantized"]["perplexity"] if results["quantized"]["perplexity"] else 5.5 ```plotly {"layout": {"title": "Baseline (FP16) vs. Quantized (INT4) LLM Performance", "barmode": "group", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Value"}, "legend_title_text": "Model Type", "height": 400}, "data": [{"type": "bar", "name": "Baseline (FP16)", "x": ["Latency (ms)", "Throughput (tok/s)", "Memory (GB)", "Perplexity"], "y": [base_lat_val, base_thr_val, base_mem_val, base_ppl_val], "marker": {"color": "#4263eb"}}, {"type": "bar", "name": "Quantized (INT4)", "x": ["Latency (ms)", "Throughput (tok/s)", "Memory (GB)", "Perplexity"], "y": [quant_lat_val, quant_thr_val, quant_mem_val, quant_ppl_val], "marker": {"color": "#12b886"}}]}Comparison of performance and quality metrics between the baseline FP16 model and its INT4 quantized version. Lower is better for Latency, Memory, and Perplexity. Higher is better for Throughput. (Note: Values are illustrative).Interpreting the ResultsThe output table and chart provide a quantitative comparison. You should observe:Latency: Quantized models typically exhibit lower latency (faster single inference) due to reduced computation and memory access times.Throughput: Higher throughput is expected for the quantized model, meaning it can process more tokens or requests per second. Our throughput measurement here is basic (tokens per second in a single long generation); more sophisticated setups might involve batching or concurrent requests for a more realistic server throughput measure.Memory Usage: Peak GPU memory usage should be significantly lower for the quantized model. This includes both the memory to load the model weights and the runtime memory for activations. Also, compare the model sizes on disk (du -sh model_directory).Perplexity: Perplexity typically increases slightly with quantization, indicating a minor degradation in the model's ability to predict the next token based on the training distribution. The acceptable level of increase depends heavily on the specific application and the quantization method used. A small increase (e.g., < 1 point) might be acceptable for significant performance gains.This practical benchmark provides essential data points. Remember that these results are specific to the hardware used, the chosen model, the quantization technique (e.g., GPTQ, AWQ, bitsandbytes), and the specific benchmarking setup (batch size, sequence length, dataset). For production scenarios, you would extend this by:Benchmarking on target deployment hardware.Evaluating on downstream tasks specific to your application (e.g., summarization ROUGE scores, question-answering accuracy).Using more sophisticated throughput measurement tools or simulating production load.Comparing different quantization methods and bit-precisions.By systematically benchmarking, you can confidently select and deploy quantized models that meet your performance requirements while maintaining acceptable quality.