Now that you understand why saving a trained model is necessary and have been introduced to serialization using Python's pickle
and joblib
libraries, let's walk through a practical example. We'll train a very simple machine learning model, save it to a file, and then load it back to make predictions, demonstrating the core workflow for model persistence.
First, ensure you have the necessary libraries installed. We'll primarily use scikit-learn
for the model and data generation, and joblib
, which is often recommended for scikit-learn objects. While pickle
is built-in, joblib
often provides better efficiency for objects containing large NumPy arrays.
If you don't have them installed, you can add them using pip:
pip install scikit-learn joblib
Let's start by creating and training a basic classification model. We'll use scikit-learn
to generate some synthetic data and train a LogisticRegression
model.
# Import necessary libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np # We'll need numpy for sample data later
# Generate synthetic classification data
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Logistic Regression model
model = LogisticRegression(solver='liblinear', random_state=42)
print("Training the model...")
model.fit(X_train, y_train)
print("Model training complete.")
# Optional: Evaluate the trained model (before saving)
y_pred_train = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_pred_train)
print(f"Accuracy on training data: {train_accuracy:.4f}")
y_pred_test = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"Accuracy on test data: {test_accuracy:.4f}")
At this point, the model
object exists only in your computer's memory. If the script ends, the trained parameters are lost.
pickle
is Python's standard library for object serialization. It can save almost any Python object.
To save the model, we open a file in binary write mode ('wb'
) and use pickle.dump()
.
import pickle
# Define the filename for the saved model
pickle_filename = 'logistic_regression_model.pkl'
print(f"Saving model to {pickle_filename} using pickle...")
# Open the file in binary write mode ('wb')
with open(pickle_filename, 'wb') as file:
# Use pickle.dump to serialize the model object and write it to the file
pickle.dump(model, file)
print("Model saved successfully with pickle.")
You should now see a file named logistic_regression_model.pkl
in your directory. This file contains the serialized representation of your trained LogisticRegression
model.
Now, let's imagine we are in a different script or have restarted our environment. We can load the saved model back into memory using pickle.load()
. We need to open the file in binary read mode ('rb'
).
import pickle
import numpy as np # Ensure numpy is imported if needed for sample data
from sklearn.metrics import accuracy_score # To verify predictions
# Define the filename where the model was saved
pickle_filename = 'logistic_regression_model.pkl'
print(f"Loading model from {pickle_filename} using pickle...")
# Open the file in binary read mode ('rb')
with open(pickle_filename, 'rb') as file:
# Use pickle.load to deserialize the object from the file
loaded_model_pickle = pickle.load(file)
print("Model loaded successfully with pickle.")
# Let's verify the loaded model works
# Create some sample data (e.g., the first test sample)
sample_data = X_test[0].reshape(1, -1) # Reshape for single prediction
expected_prediction = y_test[0]
# Make a prediction using the loaded model
prediction = loaded_model_pickle.predict(sample_data)
print(f"Prediction for sample data using loaded pickle model: {prediction[0]}")
print(f"Expected prediction: {expected_prediction}")
# Optional: Verify accuracy on the test set again
y_pred_loaded = loaded_model_pickle.predict(X_test)
loaded_accuracy = accuracy_score(y_test, y_pred_loaded)
print(f"Accuracy on test data using loaded pickle model: {loaded_accuracy:.4f}")
As you can see, the loaded model behaves exactly like the original one, producing the same predictions and accuracy.
joblib
is particularly useful for saving scikit-learn models because it's generally more efficient with objects that contain large NumPy arrays, which are common in machine learning. The syntax is very similar to pickle
.
import joblib
# Define the filename for the saved model
joblib_filename = 'logistic_regression_model.joblib'
print(f"Saving model to {joblib_filename} using joblib...")
# Use joblib.dump to serialize and save the model
# joblib handles opening/closing the file efficiently
joblib.dump(model, joblib_filename)
print("Model saved successfully with joblib.")
This creates a file named logistic_regression_model.joblib
.
Loading a model saved with joblib
is equally straightforward using joblib.load()
.
import joblib
import numpy as np # Ensure numpy is imported if needed for sample data
from sklearn.metrics import accuracy_score # To verify predictions
# Define the filename where the model was saved
joblib_filename = 'logistic_regression_model.joblib'
print(f"Loading model from {joblib_filename} using joblib...")
# Use joblib.load to deserialize the model
loaded_model_joblib = joblib.load(joblib_filename)
print("Model loaded successfully with joblib.")
# Verify the loaded model
sample_data = X_test[0].reshape(1, -1)
expected_prediction = y_test[0]
prediction = loaded_model_joblib.predict(sample_data)
print(f"Prediction for sample data using loaded joblib model: {prediction[0]}")
print(f"Expected prediction: {expected_prediction}")
# Optional: Verify accuracy on the test set
y_pred_loaded_joblib = loaded_model_joblib.predict(X_test)
loaded_accuracy_joblib = accuracy_score(y_test, y_pred_loaded_joblib)
print(f"Accuracy on test data using loaded joblib model: {loaded_accuracy_joblib:.4f}")
Again, the model loaded using joblib
performs identically to the original.
Often, your model relies on specific data preprocessing steps, like scaling or encoding, that were applied during training. These steps must also be applied to any new data before making predictions. Therefore, you need to save the fitted preprocessor object alongside your model.
Let's extend our example using StandardScaler
from scikit-learn.
# Import necessary libraries
from sklearn.preprocessing import StandardScaler
import joblib
# Assume X_train is your original training data
print("Fitting a StandardScaler...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit AND transform training data
print("Scaler fitted.")
# Now, train the model on the SCALED data
model_scaled = LogisticRegression(solver='liblinear', random_state=42)
print("Training the model on scaled data...")
model_scaled.fit(X_train_scaled, y_train)
print("Model training complete.")
# --- SAVE BOTH the scaler and the model ---
scaler_filename = 'standard_scaler.joblib'
model_scaled_filename = 'scaled_logistic_regression_model.joblib'
print(f"Saving scaler to {scaler_filename}...")
joblib.dump(scaler, scaler_filename)
print("Scaler saved.")
print(f"Saving model trained on scaled data to {model_scaled_filename}...")
joblib.dump(model_scaled, model_scaled_filename)
print("Scaled model saved.")
# --- LOAD BOTH for prediction ---
print("\nLoading scaler and scaled model for prediction...")
loaded_scaler = joblib.load(scaler_filename)
loaded_scaled_model = joblib.load(model_scaled_filename)
print("Scaler and scaled model loaded.")
# --- Use them together ---
# Take raw test data (e.g., X_test)
# IMPORTANT: Use transform(), NOT fit_transform() on new data
X_test_scaled = loaded_scaler.transform(X_test)
# Make predictions using the loaded model on the scaled test data
y_pred_loaded_scaled = loaded_scaled_model.predict(X_test_scaled)
loaded_scaled_accuracy = accuracy_score(y_test, y_pred_loaded_scaled)
print(f"\nAccuracy on test data using loaded scaler and model: {loaded_scaled_accuracy:.4f}")
In this example, we saved the scaler
and the model_scaled
as separate files using joblib
. When making predictions later, we first load both objects, then use the loaded scaler
to transform
the new data before feeding it to the loaded model_scaled
. Failing to apply the exact same scaling would lead to incorrect predictions.
You could also save related objects (like a model and its scaler) together in a dictionary or list and serialize that single object, although managing separate files is often clearer.
In this practical exercise, you learned how to:
scikit-learn
model.pickle
and joblib
.Saving and loading models are fundamental steps in preparing your machine learning work for deployment or sharing. By serializing your models and their necessary components, you ensure they can be reliably used outside the original training environment.
© 2025 ApX Machine Learning