Let's put the theory from this chapter into practice. We will take a standard diffusion model and apply several optimization techniques, measuring the impact on inference speed. We'll focus on common and effective methods like using lower precision (FP16), leveraging optimized runtimes like ONNX Runtime, and adjusting sampler settings.
For this exercise, we'll use the popular diffusers
library from Hugging Face and a pre-trained Stable Diffusion model. You'll need a Python environment with PyTorch and GPU support (CUDA). Ensure you have the following libraries installed:
pip install --upgrade diffusers transformers accelerate torch torchvision onnx onnxruntime-gpu optimum[onnxruntime-gpu]
We assume you are running this in an environment with access to an NVIDIA GPU.
First, let's load a standard Stable Diffusion model (e.g., v1.5) and establish a baseline performance measurement. We'll define a simple function to generate an image and time the inference process.
import torch
from diffusers import StableDiffusionPipeline
import time
import numpy as np
# --- Configuration ---
model_id = "runwayml/stable-diffusion-v1-5"
device = "cuda" if torch.cuda.is_available() else "cpu"
prompt = "a photograph of an astronaut riding a horse on the moon"
num_inference_steps = 50
num_runs = 5 # Number of runs for averaging latency
# --- Load Baseline Model (FP32) ---
print("Loading baseline FP32 model...")
pipe_fp32 = StableDiffusionPipeline.from_pretrained(model_id)
pipe_fp32.to(device)
print("Model loaded.")
# --- Define Inference Function ---
def measure_latency(pipe, prompt, num_steps, num_runs):
latencies = []
# Warmup run
pipe(prompt, num_inference_steps=num_steps)
torch.cuda.synchronize() # Ensure CUDA operations finish
for _ in range(num_runs):
start_time = time.time()
_ = pipe(prompt, num_inference_steps=num_steps)
torch.cuda.synchronize() # Wait for completion
end_time = time.time()
latencies.append(end_time - start_time)
avg_latency = np.mean(latencies)
std_latency = np.std(latencies)
print(f"Average latency ({num_runs} runs): {avg_latency:.4f} +/- {std_latency:.4f} seconds")
return avg_latency, std_latency
# --- Measure Baseline Performance ---
print("\n--- Measuring Baseline (FP32) Performance ---")
baseline_latency, _ = measure_latency(pipe_fp32, prompt, num_inference_steps, num_runs)
# Optional: Clear memory if needed
# del pipe_fp32
# torch.cuda.empty_cache()
Run this code. It will download the model (if you don't have it cached), load it onto the GPU, perform a warmup run, and then measure the average inference time over several runs. Note down this baseline latency. You can also monitor GPU memory usage using tools like nvidia-smi
in a separate terminal.
One of the most straightforward optimizations is switching to half-precision floating-point numbers (FP16). This reduces the model's memory footprint and often speeds up computation on modern GPUs, usually with minimal impact on generation quality. The diffusers
library makes this easy.
# --- Load FP16 Model ---
print("\n--- Loading FP16 Model ---")
# Note: If experiencing issues, try torch_dtype=torch.bfloat16 if supported
pipe_fp16 = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
revision="fp16" # Use the fp16 revision if available
)
pipe_fp16.to(device)
# If the model doesn't have a dedicated fp16 revision,
# loading with torch_dtype=torch.float16 still converts it.
print("Model loaded.")
# --- Measure FP16 Performance ---
print("\n--- Measuring FP16 Performance ---")
fp16_latency, _ = measure_latency(pipe_fp16, prompt, num_inference_steps, num_runs)
# Optional: Clear memory
# del pipe_fp16
# torch.cuda.empty_cache()
Run this section. Compare the average latency with the FP32 baseline. You should observe a significant speedup and reduced memory usage (check nvidia-smi
again). The generated images should look very similar to the FP32 output.
ONNX (Open Neural Network Exchange) provides an interoperable format for models, and ONNX Runtime is a high-performance inference engine that can execute these models, often applying further graph optimizations. The Hugging Face Optimum
library simplifies exporting diffusers
models to ONNX and running them.
from optimum.onnxruntime import ORTStableDiffusionPipeline
import os
# --- Export to ONNX and Load with ORT ---
onnx_dir = "./stable_diffusion_onnx"
# Check if ONNX model already exists to save time
if not os.path.exists(os.path.join(onnx_dir, "model.onnx")):
print("\n--- Exporting Model to ONNX ---")
# Export the FP16 PyTorch model for better ONNX performance
# You might need to login to huggingface-cli if you haven't before
ort_pipe = ORTStableDiffusionPipeline.from_pretrained(
model_id,
export=True,
revision="fp16",
torch_dtype=torch.float16 # Specify dtype during export
)
ort_pipe.save_pretrained(onnx_dir)
print(f"ONNX model saved to {onnx_dir}")
# Optional: Clear PyTorch model used for export
del ort_pipe
torch.cuda.empty_cache()
else:
print("\n--- Found existing ONNX model ---")
# Load the exported ONNX model with ORT
print("Loading ONNX model with ONNX Runtime...")
# Ensure provider is 'CUDAExecutionProvider' for GPU acceleration
ort_pipe_loaded = ORTStableDiffusionPipeline.from_pretrained(
onnx_dir,
provider="CUDAExecutionProvider" # Use 'CPUExecutionProvider' for CPU
)
# No need for .to(device) with ORT pipelines, provider handles it
print("ONNX model loaded.")
# --- Measure ONNX Runtime Performance ---
# Note: ORT inference function might differ slightly or need adaptation
# For simplicity, we'll reuse measure_latency, assuming API compatibility
# (Optimum aims for this)
print("\n--- Measuring ONNX Runtime Performance ---")
onnx_latency, _ = measure_latency(ort_pipe_loaded, prompt, num_inference_steps, num_runs)
# Optional: Clear memory
# del ort_pipe_loaded
# torch.cuda.empty_cache()
This step involves exporting the model components (UNet, VAE, Text Encoder) to the ONNX format, which might take some time during the first run. Optimum
handles this process. Then, it loads the model using ORTStableDiffusionPipeline
, specifying the CUDAExecutionProvider
for GPU acceleration. Measure the latency again. ONNX Runtime often provides speedups over native PyTorch, especially when combined with hardware-specific execution providers.
Note: ONNX export can sometimes be tricky depending on the model architecture and opset version. Optimum
greatly simplifies this for many standard models.
Diffusion models rely on iterative sampling. Using more efficient samplers or reducing the number of steps can dramatically decrease latency, although often at the cost of some generation quality or detail. Let's try switching to a faster sampler like DPMSolverMultistepScheduler
and reducing the steps. We will use the FP16 pipeline for this comparison as it's generally faster.
from diffusers import DPMSolverMultistepScheduler
# --- Configure Faster Sampler ---
# Ensure the FP16 pipe is loaded, or reload it
if 'pipe_fp16' not in locals():
print("Reloading FP16 model for sampler test...")
pipe_fp16 = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
revision="fp16"
).to(device)
print("\n--- Configuring Faster Sampler (DPM-Solver++) ---")
pipe_fp16.scheduler = DPMSolverMultistepScheduler.from_config(pipe_fp16.scheduler.config)
print("Sampler switched to DPMSolverMultistepScheduler.")
# --- Measure Performance with Fewer Steps ---
faster_sampler_steps = 25 # Reduce steps significantly
print(f"\n--- Measuring FP16 + DPM-Solver++ ({faster_sampler_steps} steps) ---")
sampler_latency, _ = measure_latency(pipe_fp16, prompt, faster_sampler_steps, num_runs)
# Compare image quality visually if desired
# image = pipe_fp16(prompt, num_inference_steps=faster_sampler_steps).images[0]
# image.save("optimized_sampler_output.png")
Run this section. By switching to DPMSolverMultistepScheduler
and halving the inference steps (from 50 to 25), you should see a substantial reduction in latency compared to the original FP16 run with 50 steps. Examine the output image (optimized_sampler_output.png
if you uncomment the save line) and compare its quality to an image generated with the default settings. Often, solvers like DPM-Solver++ maintain good quality even with significantly fewer steps.
Let's gather the latency results and visualize them.
# --- Summarize Results ---
results = {
"Baseline (FP32, 50 steps)": baseline_latency,
"FP16 (50 steps)": fp16_latency,
"ONNX Runtime (FP16, 50 steps)": onnx_latency,
"FP16 + DPM++ (25 steps)": sampler_latency
}
print("\n--- Performance Summary ---")
for name, latency in results.items():
print(f"{name}: {latency:.4f} seconds")
# Create a Plotly chart
import plotly.graph_objects as go
names = list(results.keys())
latencies = list(results.values())
# Define colors from the palette
colors = ['#4263eb', '#1c7ed6', '#15aabf', '#20c997'] # Blue, Cyan, Teal, Green
fig = go.Figure(data=[go.Bar(
x=names,
y=latencies,
marker_color=colors[:len(names)], # Use defined colors
text=[f'{l:.3f}s' for l in latencies],
textposition='auto',
)])
fig.update_layout(
title_text='Diffusion Model Inference Latency Comparison',
xaxis_title_text='Optimization Technique',
yaxis_title_text='Average Latency (seconds)',
yaxis_range=[0, max(latencies) * 1.1], # Adjust y-axis range
font=dict(family="Arial, sans-serif", size=12),
plot_bgcolor='#e9ecef', # Light gray background
paper_bgcolor='white',
bargap=0.2
)
# Display chart (or save to HTML)
# fig.show()
# Generate JSON for web display
chart_json = fig.to_json()
# Ensure it's on a single line for markdown block
chart_json_single_line = chart_json.replace('\n', '').replace('\r', '')
print("\n--- Chart Data (Plotly JSON) ---")
print(f"```plotly\n{chart_json_single_line}\n```")
Average inference latency across different optimization techniques applied to a Stable Diffusion v1.5 model. Lower values indicate better performance. Results are illustrative and depend heavily on hardware and specific model versions.
The chart clearly shows the progressive decrease in latency achieved by applying FP16 precision, using ONNX Runtime, and finally optimizing the sampling process. Remember that these results are specific to the hardware used (GPU type, driver versions) and the model.
This practical exercise demonstrated several effective techniques for optimizing diffusion model inference:
Optimum
.We didn't cover advanced quantization (like INT8) or deep compiler optimization (like TensorRT) in detail here, as they often require more complex calibration steps or specialized tooling. However, the methods explored provide substantial improvements and form the foundation for optimizing diffusion models for deployment. The best combination of techniques will depend on your specific latency, throughput, cost, and quality requirements. Always benchmark on your target hardware to validate the real-world impact of any optimization.
© 2025 ApX Machine Learning