This hands-on exercise focuses on applying techniques discussed earlier in the chapter to detect potential skew between datasets representing feature values used during model training (offline) and those observed during online serving. Detecting such discrepancies, where Ptrain(X)=Pserve(X), is significant for maintaining model performance in production. We will use Python libraries like Pandas, NumPy, and SciPy to perform statistical comparisons and Plotly for visualization.
Imagine we have logged feature data from two sources:
training_features_df
.serving_features_df
.Our goal is to compare the distributions of specific features between these two datasets to identify potential skew.
First, let's simulate these two datasets using Pandas and NumPy. We'll create two features: user_age
(numerical) and product_category
(categorical). We'll introduce a slight difference in the distribution of user_age
and product_category
between the two dataframes.
import pandas as pd
import numpy as np
from scipy import stats
# Simulate Training Data Features
np.random.seed(42)
training_data = {
'user_age': np.random.normal(loc=35, scale=10, size=1000).astype(int),
'product_category': np.random.choice(['Electronics', 'Clothing', 'Home Goods', 'Books'],
size=1000, p=[0.4, 0.3, 0.2, 0.1])
}
training_features_df = pd.DataFrame(training_data)
training_features_df['user_age'] = training_features_df['user_age'].clip(lower=18) # Ensure age is realistic
# Simulate Serving Log Features (with skew)
serving_data = {
'user_age': np.random.normal(loc=40, scale=12, size=500).astype(int), # Different mean and std dev
'product_category': np.random.choice(['Electronics', 'Clothing', 'Home Goods', 'Books'],
size=500, p=[0.35, 0.25, 0.25, 0.15]) # Different category distribution
}
serving_features_df = pd.DataFrame(serving_data)
serving_features_df['user_age'] = serving_features_df['user_age'].clip(lower=18)
print("Training Features Sample:")
print(training_features_df.head())
print("\nServing Features Sample:")
print(serving_features_df.head())
print("\nTraining Data Description:")
print(training_features_df.describe(include='all'))
print("\nServing Data Description:")
print(serving_features_df.describe(include='all'))
Running this code generates two distinct dataframes. The descriptive statistics already hint at differences, particularly in the mean age and the frequency of top product categories.
For numerical features like user_age
, a common way to compare distributions is the two-sample Kolmogorov-Smirnov (KS) test. The KS test is non-parametric and checks if two samples are drawn from the same underlying continuous distribution. It quantifies the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples.
The null hypothesis (H0) for the KS test is that the two samples are drawn from the same distribution. A small p-value (typically < 0.05) suggests rejecting H0, indicating a statistically significant difference between the distributions.
# Extract the numerical feature columns
train_age = training_features_df['user_age']
serve_age = serving_features_df['user_age']
# Perform the two-sample KS test
ks_statistic, p_value = stats.ks_2samp(train_age, serve_age)
print(f"KS Test for user_age:")
print(f" KS Statistic: {ks_statistic:.4f}")
print(f" P-value: {p_value:.4f}")
if p_value < 0.05:
print(" Result: Significant difference detected (Reject H0). Potential skew exists.")
else:
print(" Result: No significant difference detected (Fail to reject H0).")
The output likely shows a very small p-value, confirming a statistically significant difference in the user_age
distributions between our simulated training and serving datasets.
Let's visualize this difference using histograms.
Comparison of user age distributions showing a shift towards older ages in the serving data compared to the training data.
For categorical features like product_category
, we can use the Chi-Squared (χ2) test of independence. This test helps determine if there's a significant association between the dataset source (Training vs. Serving) and the distribution of categories.
First, we need to create a contingency table (cross-tabulation) showing the counts of each category in both datasets.
# Create a combined dataframe with a source identifier
training_features_df['source'] = 'Training'
serving_features_df['source'] = 'Serving'
combined_df = pd.concat([training_features_df, serving_features_df], ignore_index=True)
# Create the contingency table
contingency_table = pd.crosstab(combined_df['product_category'], combined_df['source'])
print("Contingency Table for product_category:")
print(contingency_table)
# Perform the Chi-Squared test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-Squared Test for product_category:")
print(f" Chi2 Statistic: {chi2_stat:.4f}")
print(f" P-value: {p_value:.4f}")
print(f" Degrees of Freedom: {dof}")
if p_value < 0.05:
print(" Result: Significant difference detected (Reject H0). Potential skew exists.")
else:
print(" Result: No significant difference detected (Fail to reject H0).")
Again, the small p-value suggests that the distribution of product_category
is significantly different between the training and serving datasets.
Let's visualize the proportions.
Comparison of product category proportions, highlighting differences in relative frequencies between training and serving data.
In this practice, we simulated data with known skew and used statistical tests (KS test for numerical, Chi-Squared for categorical) to detect these differences. The visualizations helped confirm the nature of the skew.
In a real-world MLOps pipeline, these steps would be automated:
This practice provides a basic framework. More sophisticated approaches involve tracking multiple distribution metrics over time, using drift detection algorithms (like Drift Detection Method - DDM, or Page Hinkley), and employing specialized data validation libraries (great_expectations
, pandera
, evidently.ai
, deepchecks
) that offer more structured ways to define expectations and detect violations, including skew. These tools often provide richer visualizations and integrations for MLOps workflows.
© 2025 ApX Machine Learning