Now that we have explored the theoretical underpinnings of model pruning and quantization, let's put these techniques into practice. This hands-on session will guide you through applying magnitude-based weight pruning and post-training quantization to a pre-trained Convolutional Neural Network (CNN) using common deep learning framework tools. Our goal is to significantly reduce the model's size while carefully monitoring the impact on its predictive performance.
We assume you have a working environment with Python, TensorFlow, and the TensorFlow Model Optimization Toolkit installed. You will also need a pre-trained Keras model. For demonstration purposes, imagine we have a functional tf.keras.Model
object, perhaps a MobileNetV2 or a custom CNN trained on a dataset like CIFAR-10, stored in a variable named original_model
.
tf.keras.Model
. Let's assume it's loaded into the original_model
variable.eval_dataset
.tensorflow
) and TensorFlow Model Optimization Toolkit (tensorflow_model_optimization
) installed.# Example setup (conceptual)
import tensorflow as tf
import tensorflow_model_optimization as tfmot
import numpy as np
# Assume original_model is a pre-trained tf.keras.Model
# Assume eval_dataset is a tf.data.Dataset or similar structure for evaluation
# Example: loading a model (replace with your actual model loading)
# original_model = tf.keras.models.load_model('path/to/your/model.h5')
# Example: Preparing a dummy evaluation dataset (replace with your actual data)
def representative_dataset_gen():
for _ in range(100):
# Yield a sample input representative of your model's input
# Adjust shape and dtype accordingly
yield [np.random.rand(1, 96, 96, 3).astype(np.float32)]
eval_dataset = tf.data.Dataset.from_generator(
representative_dataset_gen,
output_signature=tf.TensorSpec(shape=(1, 96, 96, 3), dtype=tf.float32)
)
# Function to evaluate model accuracy (replace with your specific evaluation logic)
def evaluate_model(interpreter, dataset):
# Placeholder for accuracy evaluation logic
# You would typically iterate through the dataset, run inference,
# compare predictions to ground truth, and calculate accuracy.
print("Evaluating model... (Replace with actual evaluation)")
# Example: Running inference on a few samples
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
num_samples = 0
for sample in dataset.take(5): # Evaluate on a few samples for demo
interpreter.set_tensor(input_details[0]['index'], sample)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
# Add your accuracy calculation logic here
num_samples += 1
print(f"Evaluated on {num_samples} samples.")
return np.random.rand() # Return dummy accuracy
# Function to get model size
import os
def get_gzipped_model_size(file):
# Returns size of gzipped model, approximating deployable size
import zipfile
zipped_file = file + '.zip'
with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
f.write(file)
return os.path.getsize(zipped_file) / float(1024*1024)
# Save the original model to measure its size
original_model_file = './original_model.h5'
original_model.save(original_model_file, include_optimizer=False)
print(f"Original model size: {os.path.getsize(original_model_file) / float(1024*1024):.3f} MB")
We'll use magnitude-based pruning, which removes weights with the lowest absolute values. The TensorFlow Model Optimization Toolkit provides wrappers to make models trainable with pruning.
Define Pruning Parameters: Specify the target sparsity level (e.g., 50% sparsity means half the weights will be pruned) and the pruning schedule. Polynomial decay is a common schedule.
# Define pruning parameters
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.0,
final_sparsity=0.50,
begin_step=0, # Start pruning immediately
end_step=1000) # End pruning after 1000 steps
}
Note: end_step
should typically correspond to the number of steps in your fine-tuning phase.
Apply Pruning Wrapper: Wrap the original model with the pruning configuration.
# Apply pruning wrapper
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(original_model, **pruning_params)
# Pruning requires a step callback during training/fine-tuning
callbacks = [
tfmot.sparsity.keras.UpdatePruningStep()
]
# Compile the pruned model (use the same optimizer settings as original training/fine-tuning)
pruned_model.compile(optimizer='adam',
loss='categorical_crossentropy', # Adjust loss as needed
metrics=['accuracy']) # Adjust metrics as needed
# Summary shows PrunableWrappers around original layers
# pruned_model.summary()
Fine-tune the Pruned Model: Train the model for a few epochs. During this phase, the pruning schedule actively removes weights, and the remaining weights are adjusted to compensate for the removal, ideally recovering accuracy.
# Fine-tune the model (requires training data)
# Assume train_dataset and validation_dataset are available
# print("Fine-tuning the pruned model...")
# pruned_model.fit(train_dataset,
# epochs=5, # Adjust number of epochs
# validation_data=validation_dataset,
# callbacks=callbacks)
print("Fine-tuning step simulation (skipping actual training for brevity).")
Note: Effective fine-tuning requires your actual training dataset and appropriate hyperparameters.
Strip Pruning Wrappers: After fine-tuning, remove the pruning wrappers to get a standard Keras model that is smaller because many weights are now zero.
# Remove pruning wrappers for a standard, smaller model
model_for_export = tfmot.sparsity.keras.strip_pruning(pruned_model)
# Save the pruned model
pruned_model_file = './pruned_model.h5'
model_for_export.save(pruned_model_file, include_optimizer=False)
print(f"Pruned model size (H5): {os.path.getsize(pruned_model_file) / float(1024*1024):.3f} MB")
Observe the reduction in the .h5
file size compared to the original. Compressing the model (e.g., using gzip) often reveals a more significant size reduction as the zero weights compress well.
Quantization reduces the precision of weights and potentially activations. Post-training quantization is simpler to apply as it doesn't require retraining, though it might lead to a larger accuracy drop compared to quantization-aware training. We'll use TensorFlow Lite's converter.
Convert to TensorFlow Lite (FP32): First, convert the pruned Keras model (or the original model if skipping pruning) to the standard TensorFlow Lite format (floating-point).
# Convert the pruned Keras model to TFLite FP32
converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
tflite_fp32_model = converter.convert()
# Save the FP32 TFLite model
tflite_fp32_file = './pruned_model_fp32.tflite'
with open(tflite_fp32_file, 'wb') as f:
f.write(tflite_fp32_model)
print(f"Pruned FP32 TFLite size: {os.path.getsize(tflite_fp32_file) / float(1024*1024):.3f} MB")
print(f"Pruned FP32 TFLite (gzipped): {get_gzipped_model_size(tflite_fp32_file):.3f} MB")
Apply Post-Training Integer Quantization (INT8): Use the TFLite converter again, but this time enable optimizations for INT8 quantization. This requires a representative dataset to calibrate the range of activations.
# Convert using INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.DEFAULT] # Enable default optimizations (includes INT8)
converter.representative_dataset = representative_dataset_gen # Provide calibration data
# Ensure integer only quantization for compatible hardware
# converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# converter.inference_input_type = tf.int8 # or tf.uint8
# converter.inference_output_type = tf.int8 # or tf.uint8
tflite_int8_model = converter.convert()
# Save the INT8 TFLite model
tflite_int8_file = './pruned_quantized_int8.tflite'
with open(tflite_int8_file, 'wb') as f:
f.write(tflite_int8_model)
print(f"Pruned INT8 TFLite size: {os.path.getsize(tflite_int8_file) / float(1024*1024):.3f} MB")
print(f"Pruned INT8 TFLite (gzipped): {get_gzipped_model_size(tflite_int8_file):.3f} MB")
You should see a significant size reduction (often around 4x compared to the FP32 TFLite model) because weights are now stored using 8-bit integers instead of 32-bit floats.
It is essential to evaluate the accuracy of the optimized models to understand the trade-offs. You'll need the TensorFlow Lite Interpreter.
# Load the TFLite models and allocate tensors
interpreter_fp32 = tf.lite.Interpreter(model_path=tflite_fp32_file)
interpreter_fp32.allocate_tensors()
interpreter_int8 = tf.lite.Interpreter(model_path=tflite_int8_file)
interpreter_int8.allocate_tensors()
# Evaluate accuracy (using your evaluation dataset and logic)
# Note: Input/output types might change for INT8 models if specified during conversion
# You might need to quantize input data and dequantize output data manually.
print("\nEvaluating FP32 TFLite model:")
accuracy_fp32 = evaluate_model(interpreter_fp32, eval_dataset)
print(f"FP32 TFLite Accuracy: {accuracy_fp32:.4f}")
print("\nEvaluating INT8 TFLite model:")
accuracy_int8 = evaluate_model(interpreter_int8, eval_dataset) # Adjust evaluation for INT8 if needed
print(f"INT8 TFLite Accuracy: {accuracy_int8:.4f}")
# You would also evaluate the original model's accuracy for comparison
# original_accuracy = original_model.evaluate(eval_dataset)[1] # Example Keras evaluation
original_accuracy = np.random.rand() + 0.1 # Dummy original accuracy
print(f"\nOriginal Model Accuracy (Simulated): {original_accuracy:.4f}")
Compare the results: model size (original, pruned H5, FP32 TFLite, INT8 TFLite) and accuracy (original, FP32 TFLite, INT8 TFLite). Typically, you'll observe:
Comparison of model file sizes before and after applying pruning and quantization techniques. Note the significant reduction achieved by INT8 quantization.
Relationship between model accuracy and size for different optimization stages. INT8 quantization provides the smallest size but might incur a higher accuracy penalty.
This practical exercise demonstrates how pruning and quantization can make complex CNN models significantly more efficient. Remember that the best strategy often depends on the specific model, task, and deployment constraints. Experimenting with different sparsity levels, quantization methods (like quantization-aware training), and fine-tuning strategies is often necessary to achieve the optimal balance between efficiency and performance.
© 2025 ApX Machine Learning