This hands-on exercise focuses on applying techniques discussed earlier in the chapter to detect potential skew between datasets representing feature values used during model training (offline) and those observed during online serving. Detecting such discrepancies, where $P_{train}(X) \ne P_{serve}(X)$, is significant for maintaining model performance in production. We will use Python libraries like Pandas, NumPy, and SciPy to perform statistical comparisons and Plotly for visualization.Setting Up the ScenarioImagine we have logged feature data from two sources:Training Dataset Features: Features generated via a batch process, used to train a model. Let's call this training_features_df.Serving Log Features: Features logged from the online system right before making predictions over a specific period. Let's call this serving_features_df.Our goal is to compare the distributions of specific features between these two datasets to identify potential skew.Generating Example DataFirst, let's simulate these two datasets using Pandas and NumPy. We'll create two features: user_age (numerical) and product_category (categorical). We'll introduce a slight difference in the distribution of user_age and product_category between the two dataframes.import pandas as pd import numpy as np from scipy import stats # Simulate Training Data Features np.random.seed(42) training_data = { 'user_age': np.random.normal(loc=35, scale=10, size=1000).astype(int), 'product_category': np.random.choice(['Electronics', 'Clothing', 'Home Goods', 'Books'], size=1000, p=[0.4, 0.3, 0.2, 0.1]) } training_features_df = pd.DataFrame(training_data) training_features_df['user_age'] = training_features_df['user_age'].clip(lower=18) # Ensure age is realistic # Simulate Serving Log Features (with skew) serving_data = { 'user_age': np.random.normal(loc=40, scale=12, size=500).astype(int), # Different mean and std dev 'product_category': np.random.choice(['Electronics', 'Clothing', 'Home Goods', 'Books'], size=500, p=[0.35, 0.25, 0.25, 0.15]) # Different category distribution } serving_features_df = pd.DataFrame(serving_data) serving_features_df['user_age'] = serving_features_df['user_age'].clip(lower=18) print("Training Features Sample:") print(training_features_df.head()) print("\nServing Features Sample:") print(serving_features_df.head()) print("\nTraining Data Description:") print(training_features_df.describe(include='all')) print("\nServing Data Description:") print(serving_features_df.describe(include='all'))Running this code generates two distinct dataframes. The descriptive statistics already hint at differences, particularly in the mean age and the frequency of top product categories.Comparing Numerical Feature DistributionsFor numerical features like user_age, a common way to compare distributions is the two-sample Kolmogorov-Smirnov (KS) test. The KS test is non-parametric and checks if two samples are drawn from the same underlying continuous distribution. It quantifies the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples.The null hypothesis ($H_0$) for the KS test is that the two samples are drawn from the same distribution. A small p-value (typically < 0.05) suggests rejecting $H_0$, indicating a statistically significant difference between the distributions.# Extract the numerical feature columns train_age = training_features_df['user_age'] serve_age = serving_features_df['user_age'] # Perform the two-sample KS test ks_statistic, p_value = stats.ks_2samp(train_age, serve_age) print(f"KS Test for user_age:") print(f" KS Statistic: {ks_statistic:.4f}") print(f" P-value: {p_value:.4f}") if p_value < 0.05: print(" Result: Significant difference detected (Reject H0). Potential skew exists.") else: print(" Result: No significant difference detected (Fail to reject H0).") The output likely shows a very small p-value, confirming a statistically significant difference in the user_age distributions between our simulated training and serving datasets.Let's visualize this difference using histograms.{"layout":{"title":"Distribution of User Age: Training vs. Serving","xaxis":{"title":"User Age"},"yaxis":{"title":"Count"},"barmode":"overlay","legend":{"traceorder":"reversed"},"autosize":true},"data":[{"type":"histogram","x":[35,45,25,30,40,50,28,38,48,33,43,23,36,46,26,31,41,51,29,39,49,34,44,24,37,47,27,32,42,52,22,18,55,60,20,35,45,25,30,40,50,28,38,48,33,43,23,36,46,26,31,41,51,29,39,49,34,44,24,37,47,27,32,42,52,22,18,55,60,20,35,45,25,30,40,50,28,38,48,33,43,23,36,46,26,31,41,51,29,39,49,34,44,24,37,47,27,32,42,52,22,18,55,60,20,35,45,25,30,40,50,28,38,48,33,43,23,36,46,26,31,41,51,29,39,49,34,44,24,37,47,27,32,42,52,22,18,55,60,20],"name":"Training","opacity":0.75,"marker":{"color":"#4263eb"}},{"type":"histogram","x":[40,50,30,35,45,55,32,42,52,38,48,28,41,51,31,36,46,56,33,43,53,39,49,29,42,52,32,37,47,57,27,20,60,65,25,40,50,30,35,45,55,32,42,52,38,48,28,41,51,31,36,46,56,33,43,53,39,49,29,42,52,32,37,47,57,27,20,60,65,25],"name":"Serving","opacity":0.75,"marker":{"color":"#f06595"}}]}Comparison of user age distributions showing a shift towards older ages in the serving data compared to the training data.Comparing Categorical Feature DistributionsFor categorical features like product_category, we can use the Chi-Squared ($\chi^2$) test of independence. This test helps determine if there's a significant association between the dataset source (Training vs. Serving) and the distribution of categories.First, we need to create a contingency table (cross-tabulation) showing the counts of each category in both datasets.# Create a combined dataframe with a source identifier training_features_df['source'] = 'Training' serving_features_df['source'] = 'Serving' combined_df = pd.concat([training_features_df, serving_features_df], ignore_index=True) # Create the contingency table contingency_table = pd.crosstab(combined_df['product_category'], combined_df['source']) print("Contingency Table for product_category:") print(contingency_table) # Perform the Chi-Squared test chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table) print(f"\nChi-Squared Test for product_category:") print(f" Chi2 Statistic: {chi2_stat:.4f}") print(f" P-value: {p_value:.4f}") print(f" Degrees of Freedom: {dof}") if p_value < 0.05: print(" Result: Significant difference detected (Reject H0). Potential skew exists.") else: print(" Result: No significant difference detected (Fail to reject H0).")Again, the small p-value suggests that the distribution of product_category is significantly different between the training and serving datasets.Let's visualize the proportions.{"layout":{"title":"Proportion of Product Categories: Training vs. Serving","xaxis":{"title":"Product Category"},"yaxis":{"title":"Proportion","tickformat":".0%"},"barmode":"group","autosize":true},"data":[{"type":"bar","x":["Electronics","Clothing","Home Goods","Books"],"y":[0.40,0.30,0.20,0.10],"name":"Training","marker":{"color":"#1c7ed6"}},{"type":"bar","x":["Electronics","Clothing","Home Goods","Books"],"y":[0.35,0.25,0.25,0.15],"name":"Serving","marker":{"color":"#e64980"}}]}Comparison of product category proportions, highlighting differences in relative frequencies between training and serving data.Interpretation and ActionIn this practice, we simulated data with known skew and used statistical tests (KS test for numerical, Chi-Squared for categorical) to detect these differences. The visualizations helped confirm the nature of the skew.In a MLOps pipeline, these steps would be automated:Data Collection: Sample recent serving logs and retrieve the corresponding training data feature statistics (or the full dataset if feasible).Comparison: Run statistical tests for relevant features.Thresholding: Compare test statistics or p-values against predefined thresholds. These thresholds are critical and domain-specific. A statistically significant difference (small p-value) might not always be practically significant. You might set thresholds on the KS statistic itself or require a p-value below a stricter alpha (e.g., 0.01).Alerting/Action: If thresholds are breached, trigger alerts for investigation. Actions could range from retraining the model with newer data, investigating upstream data pipeline issues, or adjusting feature engineering logic.This practice provides a basic framework. More sophisticated approaches involve tracking multiple distribution metrics over time, using drift detection algorithms (like Drift Detection Method - DDM, or Page Hinkley), and employing specialized data validation libraries (great_expectations, pandera, evidently.ai, deepchecks) that offer more structured ways to define expectations and detect violations, including skew. These tools often provide richer visualizations and integrations for MLOps workflows.