After splitting your data into training and testing sets, the next significant step is applying preprocessing transformations like feature scaling or encoding categorical variables. A common pitfall here is treating the training and test sets independently during transformation, which can lead to misleading results and poor model generalization. It's essential that any information used to transform the data is learned only from the training set.
Imagine you're preparing for an exam (your test set). If you get to see the exam questions and answers (information from the test set) while studying (training your model and preparing data), your performance on that specific exam might look great, but it won't reflect how well you'd do on a different, unseen exam.
Similarly, if you fit a data transformer, like a StandardScaler
, separately on your test set, the scaler learns the mean and standard deviation of the test set. This incorporates information from the test data into your preprocessing pipeline before the model evaluation phase. This is a form of data leakage. The model's performance on this "contaminated" test set will be artificially inflated because the preprocessing steps had access to information they wouldn't have in a real-world scenario where new, unseen data arrives.
Consider StandardScaler
, which centers data by subtracting the mean and scales it by dividing by the standard deviation.
Where x is the original feature value, μ is the mean, and σ is the standard deviation. If you calculate μtrain and σtrain from the training set and μtest and σtest from the test set, and then scale them independently:
X_train_scaled
uses μtrain and σtrain.X_test_scaled
uses μtest and σtest.The test data is now scaled based on its own properties, not based on the properties learned during training. This breaks the assumption that the test set represents unseen data processed in the exact same way as the training data.
The correct methodology ensures that the test set remains truly "unseen" during the fitting process. Scikit-learn transformers are designed with this principle in mind, offering separate methods:
fit(X_train)
: This method learns the necessary parameters from the training data only. For StandardScaler
, it calculates the mean (μ) and standard deviation (σ). For OneHotEncoder
, it determines the unique categories in each feature.transform(X)
: This method applies the transformation using the parameters learned during the fit
step. It does not recalculate anything.The proper workflow is:
X_train
, X_test
, y_train
, y_test
.scaler = StandardScaler()
).scaler.fit(X_train)
.X_train_transformed = scaler.transform(X_train)
.X_test_transformed = scaler.transform(X_test)
.Notice that fit
is called only once, on X_train
. Both X_train
and X_test
are transformed using the parameters derived solely from X_train
.
Scikit-learn also provides a convenience method:
fit_transform(X_train)
: This combines steps 3 and 4 (fit
and transform
) into a single call, but should only be used on the training data.Let's see this in action:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Sample data
data = {'feature1': [10, 15, 12, 18, 20, 5, 25, 22],
'feature2': [100, 110, 105, 120, 125, 90, 130, 128]}
df = pd.DataFrame(data)
y = np.array([0, 1, 0, 1, 1, 0, 1, 1]) # Example target variable
# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.25, random_state=42)
print("Original Training Data:")
print(X_train)
print("\nOriginal Test Data:")
print(X_test)
# 2. Instantiate scaler
scaler = StandardScaler()
# 3. Fit scaler ONLY on training data
scaler.fit(X_train)
# Print learned parameters (mean and scale/std_dev)
print(f"\nScaler Mean (learned from train): {scaler.mean_}")
print(f"Scaler Scale (std dev learned from train): {scaler.scale_}")
# 4. Transform training data
X_train_scaled = scaler.transform(X_train)
print("\nScaled Training Data:")
print(X_train_scaled)
# 5. Transform test data USING THE SAME FIT
X_test_scaled = scaler.transform(X_test)
print("\nScaled Test Data:")
print(X_test_scaled)
# --- Incorrect way: Fitting separately on test data ---
# scaler_test = StandardScaler()
# scaler_test.fit(X_test) # This is WRONG - uses test set info
# X_test_scaled_wrong = scaler_test.transform(X_test)
# print("\nIncorrectly Scaled Test Data (fit on test):")
# print(X_test_scaled_wrong)
# Note how the means and scales would differ if fit separately.
The output clearly shows the scaler learning parameters (μ and σ) from X_train
and then applying these exact parameters to standardize both X_train
and X_test
.
The following diagram illustrates the correct data flow for applying transformations:
Data is split first. The transformer is fitted only on the training data to learn parameters (like mean μ and standard deviation σ). These learned parameters are then used to transform both the training and the test datasets consistently before model training and evaluation.
Manually managing the fit
and transform
steps for multiple transformers can become cumbersome and error-prone. This is where Scikit-learn Pipeline
objects become invaluable.
As introduced previously, a Pipeline
chains multiple steps (transformers and a final estimator). When you call pipeline.fit(X_train, y_train)
:
fit_transform
on the first step (transformer) using X_train
.X_train
.fit
the estimator (the last step).Crucially, when you later call pipeline.predict(X_test)
or pipeline.score(X_test, y_test)
:
transform
on the first step using X_test
. It uses the parameters learned during the fit
phase (on X_train
).X_test
data is passed to the transform
method of the next step, again using parameters learned during fitting.X_test
data to make predictions or evaluate performance.The pipeline automatically ensures that transformations are applied consistently, preventing data leakage by strictly separating the fitting process (on training data) from the transformation process applied to both training and test data.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression # Example estimator
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Sample data with categorical feature
data_cat = {'feature1': [10, 15, 12, 18, 20, 5, 25, 22],
'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B']}
df_cat = pd.DataFrame(data_cat)
y_cat = np.array([0, 1, 0, 1, 1, 0, 1, 1])
# Split data
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(
df_cat, y_cat, test_size=0.25, random_state=42
)
# Create a preprocessor using ColumnTransformer
# Scale numerical features, one-hot encode categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['feature1']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['category']) # handle_unknown is important for test set categories not seen in train
],
remainder='passthrough' # Keep other columns if any (none here)
)
# Create the full pipeline
pipeline = Pipeline([
('preprocess', preprocessor),
('classifier', LogisticRegression())
])
# Fit the pipeline - Transformers are fit ONLY on X_train_cat here
pipeline.fit(X_train_cat, y_train_cat)
# Predict on test data - Transformers ONLY transform X_test_cat using learned params
predictions = pipeline.predict(X_test_cat)
score = pipeline.score(X_test_cat, y_test_cat)
print(f"\nPipeline trained successfully.")
print(f"Test set predictions: {predictions}")
print(f"Test set accuracy: {score:.4f}")
# You can inspect the fitted parameters within the pipeline steps
print("\nFitted StandardScaler Mean (from pipeline):")
print(pipeline.named_steps['preprocess'].transformers_[0][1].mean_)
print("\nFitted OneHotEncoder Categories (from pipeline):")
print(pipeline.named_steps['preprocess'].transformers_[1][1].categories_)
This example demonstrates how the Pipeline
combined with ColumnTransformer
handles different transformations for different column types, while rigorously maintaining the separation between fitting on training data and transforming test data.
Applying transformations consistently is not just a best practice; it's fundamental for building machine learning models that you can trust to perform reliably on new, unseen data. Using Scikit-learn's transformer API (fit
/transform
) correctly, especially encapsulated within Pipelines, is the standard way to achieve this consistency.
© 2025 ApX Machine Learning