When you train a machine learning model, you typically perform several data preparation steps first. This might involve handling missing values, scaling numerical features, or encoding categorical features. Your model learns patterns based on the data after these transformations have been applied.
Consider what happens when you want to use this trained model to make predictions on new, unseen data. This new data will arrive in its original, raw format. If you feed this raw data directly into your model, the results will likely be nonsensical or highly inaccurate. Why? Because the model expects data that looks exactly like the transformed data it was trained on.
Therefore, it's not enough to just save the trained model object itself. You also need to save the state of the preprocessing steps you applied during training. This ensures that you can apply the exact same transformations, using the exact same parameters (like the mean and standard deviation learned by a scaler, or the categories learned by an encoder), to any new data before feeding it to the model for prediction.
Imagine you trained a model to predict house prices. One of your features is the size of the house in square feet. During training, you used a StandardScaler
from scikit-learn to scale this feature, transforming it to have a mean of 0 and a standard deviation of 1.
Now, a new prediction request comes in for a house with a size of 1500 square feet.
1500
directly to the model. The model was trained on scaled values (e.g., maybe the scaled equivalent of 1500 sq ft was 0.5) and will interpret 1500
as an extremely large, outlier value, leading to a wildly incorrect price prediction.StandardScaler
object you used during training. Use its transform
method on the new value (1500
) to get the correctly scaled value (e.g., 0.5
). Feed this scaled value to the model.The same principle applies to other preprocessing steps like one-hot encoding. If you used OneHotEncoder
during training, you must use the same fitted encoder for prediction. This ensures that new data is converted into the same set of binary columns, handling known categories correctly and managing any previously unseen categories according to how the encoder was configured during training.
Fortunately, the tools we discussed for saving models, pickle
and joblib
, work just as well for saving these preprocessing objects. Objects like scikit-learn's StandardScaler
, MinMaxScaler
, OneHotEncoder
, or LabelEncoder
are Python objects whose state (the learned parameters) can be serialized after fitting them to your training data.
Here's a conceptual example using scikit-learn and joblib
:
# Assume 'X_train' is your training data
from sklearn.preprocessing import StandardScaler
import joblib
# 1. Initialize the scaler
scaler = StandardScaler()
# 2. Fit the scaler to the training data (it learns mean and std dev)
scaler.fit(X_train)
# 3. Save the fitted scaler object
joblib.dump(scaler, 'scaler.joblib')
# --- Later, for prediction ---
# Load the saved scaler
loaded_scaler = joblib.load('scaler.joblib')
# Assume 'X_new' is new, raw data
# Apply the *exact same* transformation
X_new_scaled = loaded_scaler.transform(X_new)
# Now X_new_scaled is ready to be fed to the model
# (Assuming the model was also saved and loaded separately)
You would follow a similar process for saving your trained model object. You save the fitted scaler and the fitted model, then load both when you need to make predictions.
Managing multiple separate files (one for the scaler, one for the encoder, one for the model, etc.) can become cumbersome and error-prone. Did you save the right version of the scaler with the right version of the model?
A more elegant and robust approach is to use Scikit-learn's Pipeline
object. A pipeline chains together multiple steps (like preprocessing and modeling) into a single object. When you fit
the pipeline, it fits all the steps sequentially on the training data. When you predict
using the pipeline, it automatically applies all the fitted transformations in the correct order to the input data before passing it to the final estimator (your model) for prediction.
# Assume 'X_train', 'y_train' are your training data and labels
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib
# 1. Define the steps in the pipeline
# ('scaler', StandardScaler()) is a tuple: ('name', object)
# ('classifier', LogisticRegression()) is the final estimator
steps = [
('scaler', StandardScaler()),
('classifier', LogisticRegression())
]
# 2. Create the pipeline
pipeline = Pipeline(steps)
# 3. Fit the entire pipeline
# This fits the scaler on X_train, then transforms X_train,
# and finally fits the LogisticRegression model on the transformed X_train.
pipeline.fit(X_train, y_train)
# 4. Save the *entire* fitted pipeline object
joblib.dump(pipeline, 'full_pipeline.joblib')
# --- Later, for prediction ---
# Load the single pipeline object
loaded_pipeline = joblib.load('full_pipeline.joblib')
# Assume 'X_new' is new, raw data
# Make predictions directly using the pipeline
# It automatically scales X_new using the fitted scaler
# and then predicts using the fitted model.
predictions = loaded_pipeline.predict(X_new)
Here's a simple visualization of how data flows through this pipeline during prediction:
Data flow during prediction using a loaded scikit-learn Pipeline containing a scaler and a model.
Using pipelines significantly simplifies the deployment process:
When preparing your model for deployment, always consider how you will handle the associated preprocessing steps. Saving individual fitted preprocessors is feasible, but using pipelines is often a cleaner and more reliable strategy, especially as your modeling process becomes more complex.
© 2025 ApX Machine Learning