Practice Refactoring and Optimizing Python ML Code

Theory is essential, but applying principles through practice solidifies understanding. This section provides hands-on exercises where you will take existing Python code snippets commonly found in machine learning workflows and improve them based on the concepts discussed in this chapter: readability, efficiency, and maintainability.

We'll focus on identifying areas for improvement, applying refactoring techniques, optimizing performance-sensitive parts, and structuring the code more logically.

Exercise 1: Improving Readability and Structure

Consider the following function designed to calculate the mean squared error (MSE) between predicted and actual values. While functional, it can be improved for clarity and adherence to standard practices.

Original Code:

# Function to calculate mse
def calculate_error(d1, d2):
    if len(d1) != len(d2):
        print("Error: Input arrays must have the same length.")
        return None
    err = 0
    for i in range(len(d1)):
        err += (d1[i] - d2[i])**2
    mse = err / len(d1)
    return mse

# Example Usage
predictions = [2.1, 3.4, 1.9, 5.0, 4.3]
actuals = [2.5, 3.0, 1.5, 5.5, 4.0]

result = calculate_error(predictions, actuals)
if result is not None:
    print(f"Calculated MSE: {result}")

Tasks:

Refactor for Readability: Improve variable names (d1, d2, err). Add type hints and a docstring explaining the function's purpose, arguments, and return value.
Use NumPy for Efficiency: Replace the explicit loop with NumPy's vectorized operations for calculating the squared difference and the mean. This is generally much faster for numerical computations.
Improve Error Handling: Instead of printing an error message and returning None, raise a specific exception (like ValueError) for invalid input. This is more idiomatic in Python and allows calling code to handle the error programmatically.

Refactored Code:

import numpy as np
from typing import List, Union

def mean_squared_error(y_true: Union[List[float], np.ndarray],
                       y_pred: Union[List[float], np.ndarray]) -> float:
    """
    Calculates the Mean Squared Error (MSE) between true and predicted values.

    Args:
        y_true: Array-like structure of true target values.
        y_pred: Array-like structure of predicted values.

    Returns:
        The calculated Mean Squared Error.

    Raises:
        ValueError: If the input arrays have different lengths.
    """
    # Convert inputs to NumPy arrays for efficient computation and error checking
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError("Input arrays must have the same shape.")

    if y_true.ndim == 0: # Handle scalar inputs gracefully
         raise ValueError("Inputs must be array-like, not scalars.")

    # Calculate MSE using vectorized operations
    mse = np.mean((y_true - y_pred) ** 2)
    return mse

# Example Usage with refactored function
predictions = np.array([2.1, 3.4, 1.9, 5.0, 4.3])
actuals = np.array([2.5, 3.0, 1.5, 5.5, 4.0])

try:
    result = mean_squared_error(actuals, predictions)
    print(f"Calculated MSE (Refactored): {result:.4f}")

    # Example of error handling
    different_length_preds = np.array([1.0, 2.0])
    mean_squared_error(actuals, different_length_preds)

except ValueError as e:
    print(f"Error encountered: {e}")

Discussion:

The refactored version is more efficient.

Readability: Variable names (y_true, y_pred, mse) are descriptive. The docstring clearly explains the function, and type hints improve understanding and allow for static analysis.
Efficiency: Using np.array() and np.mean() uses NumPy's optimized C implementations, which significantly outperform Python loops for numerical tasks, especially on large arrays.
Maintainability: Raising a ValueError provides a standard way to signal invalid input, making the function easier to integrate into larger systems where errors need to be caught and handled. Converting inputs to NumPy arrays at the start simplifies the core logic.

Exercise 2: Optimizing Data Processing with Pandas

Imagine you have a Pandas DataFrame and need to apply a conditional transformation: calculate a 'discounted_price' based on the 'category' and 'original_price'. A common, but often inefficient, approach is to iterate over rows.

Original (Inefficient) Code:

import pandas as pd
import time

# Sample DataFrame
data = {
    'product_id': range(10000),
    'category': ['Electronics', 'Clothing', 'Groceries', 'Books'] * 2500,
    'original_price': [100 * (i % 10 + 1) for i in range(10000)]
}
df = pd.DataFrame(data)

# Inefficient approach using iterrows
start_time = time.time()
discounted_prices = []
for index, row in df.iterrows():
    price = row['original_price']
    category = row['category']
    if category == 'Electronics':
        discount = 0.10 # 10% discount
    elif category == 'Clothing':
        discount = 0.15 # 15% discount
    else:
        discount = 0.05 # 5% discount
    discounted_prices.append(price * (1 - discount))

df['discounted_price'] = discounted_prices
end_time = time.time()

print(f"iterrows() duration: {end_time - start_time:.4f} seconds")
print(df.head())
# Cleanup the added column for the next example
df = df.drop(columns=['discounted_price'])

Tasks:

Identify Bottleneck: Recognize that iterrows() is generally slow for operations that can be vectorized or applied more directly.
Apply Vectorized Approach: Use Pandas' built-in capabilities like boolean indexing or the apply() method (used judiciously) or, even better, np.select for conditional logic across columns.

Optimized Code using np.select:

import pandas as pd
import numpy as np
import time

# Sample DataFrame (same as before)
data = {
    'product_id': range(10000),
    'category': ['Electronics', 'Clothing', 'Groceries', 'Books'] * 2500,
    'original_price': [100 * (i % 10 + 1) for i in range(10000)]
}
df = pd.DataFrame(data)

# Optimized approach using np.select for conditional logic
start_time = time.time()

conditions = [
    df['category'] == 'Electronics',
    df['category'] == 'Clothing'
]
# Corresponding discount rates (as multipliers, 1 - discount)
choices = [
    0.90, # 1 - 0.10
    0.85  # 1 - 0.15
]
# Default discount rate multiplier
default_choice = 0.95 # 1 - 0.05

discount_multiplier = np.select(conditions, choices, default=default_choice)
df['discounted_price'] = df['original_price'] * discount_multiplier

end_time = time.time()

print(f"np.select duration: {end_time - start_time:.4f} seconds")
print(df.head())

Discussion:

Performance: The np.select approach performs the conditional logic and multiplication using highly optimized NumPy operations under the hood, applied across the entire Series at once. This avoids the row-by-row overhead of iterrows(), resulting in a significant speedup, especially for larger DataFrames. You'll typically observe that the vectorized approach is orders of magnitude faster.
Readability: While perhaps slightly less direct than a loop for beginners, the np.select method clearly expresses the conditions and corresponding outcomes. It separates the logic (conditions, choices) from the application.
Pandas Best Practices: Avoiding explicit row iteration is a fundamental principle for writing efficient Pandas code. Functions like np.select, boolean masking, and map (for simple replacements) are preferred. While apply() can be used, it's often slower than fully vectorized methods like np.select when applicable.

Exercise 3: Refactoring for Modularity

Consider a script that performs several data preparation steps sequentially within the main script body.

Original Code Snippet:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('some_data.csv') # Assume this file exists with numeric features and a target

# --- Step 1: Handle Missing Values ---
# Simple imputation with mean for numerical columns
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
if 'target' in numeric_cols: numeric_cols.remove('target') # Exclude target
for col in numeric_cols:
    if df[col].isnull().any():
        mean_val = df[col].mean()
        df[col].fillna(mean_val, inplace=True)
print("Missing values handled.")

# --- Step 2: Feature Scaling ---
scaler = StandardScaler()
# Avoid scaling the target variable if it's present
features_to_scale = [col for col in numeric_cols if col != 'target']
if features_to_scale: # Check if there are features to scale
    df[features_to_scale] = scaler.fit_transform(df[features_to_scale])
print("Features scaled.")

# --- Step 3: Split Data ---
X = df[features_to_scale]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split completed.")

# ... subsequent code uses X_train, X_test, y_train, y_test ...
print(f"Training set shape: {X_train.shape}")

Tasks:

Identify Logical Blocks: Recognize distinct data processing steps (loading, imputation, scaling, splitting).
Encapsulate in Functions: Move each logical block into its own function. This improves organization, makes the code reusable, and easier to test individually.
Improve Main Script Flow: The main part of the script should now primarily orchestrate the calls to these functions, making the overall workflow clearer.

Refactored Code:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from typing import Tuple, List

def load_data(filepath: str) -> pd.DataFrame:
    """Loads data from a CSV file."""
    try:
        df = pd.read_csv(filepath)
        print(f"Data loaded successfully from {filepath}")
        return df
    except FileNotFoundError:
        print(f"Error: File not found at {filepath}")
        # Depending on context, might raise error or return empty DataFrame
        raise

def impute_missing_numeric(df: pd.DataFrame, exclude_cols: List[str] = None) -> pd.DataFrame:
    """Imputes missing values in numeric columns using the mean."""
    df_copy = df.copy() # Work on a copy to avoid modifying original DataFrame unexpectedly
    if exclude_cols is None:
        exclude_cols = []
    numeric_cols = df_copy.select_dtypes(include=np.number).columns.tolist()
    cols_to_impute = [col for col in numeric_cols if col not in exclude_cols]

    for col in cols_to_impute:
        if df_copy[col].isnull().any():
            mean_val = df_copy[col].mean()
            df_copy[col].fillna(mean_val, inplace=True)
            print(f"Imputed missing values in column '{col}' with mean {mean_val:.2f}")
    return df_copy

def scale_features(df: pd.DataFrame, cols_to_scale: List[str]) -> Tuple[pd.DataFrame, StandardScaler]:
    """Scales specified numerical features using StandardScaler."""
    df_copy = df.copy()
    scaler = StandardScaler()
    if cols_to_scale: # Only scale if columns are provided
        df_copy[cols_to_scale] = scaler.fit_transform(df_copy[cols_to_scale])
        print(f"Scaled columns: {', '.join(cols_to_scale)}")
    else:
        print("No columns specified for scaling.")
    # Return the scaler so it can be used on test data later
    return df_copy, scaler

def split_data(df: pd.DataFrame, feature_cols: List[str], target_col: str, test_size: float = 0.2, random_state: int = 42) -> Tuple:
    """Splits data into training and testing sets."""
    X = df[feature_cols]
    y = df[target_col]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    print(f"Data split into training ({X_train.shape[0]} samples) and testing ({X_test.shape[0]} samples)")
    return X_train, X_test, y_train, y_test

# --- Main Script Workflow ---
if __name__ == "__main__": # Good practice to put script logic here
    try:
        FILE_PATH = 'some_data.csv' # Define constants
        TARGET_COLUMN = 'target'

        # Create dummy data if file doesn't exist for demonstration
        try:
            pd.read_csv(FILE_PATH)
        except FileNotFoundError:
             print(f"'{FILE_PATH}' not found. Creating dummy data.")
             dummy_data = pd.DataFrame({
                 'feature1': np.random.rand(100) * 10,
                 'feature2': np.random.rand(100) * 5 + np.random.choice([np.nan, 1], size=100, p=[0.1, 0.9]), # Add some NaNs
                 'feature3': np.random.randint(0, 5, 100),
                 TARGET_COLUMN: np.random.randint(0, 2, 100)
             })
             dummy_data.to_csv(FILE_PATH, index=False)


        raw_df = load_data(FILE_PATH)

        # Identify feature columns dynamically (excluding target)
        all_numeric_cols = raw_df.select_dtypes(include=np.number).columns.tolist()
        feature_columns = [col for col in all_numeric_cols if col != TARGET_COLUMN]

        imputed_df = impute_missing_numeric(raw_df, exclude_cols=[TARGET_COLUMN])
        scaled_df, fitted_scaler = scale_features(imputed_df, feature_columns)

        # Note: For a real scenario, you'd save 'fitted_scaler' and use scaler.transform() on new/test data later.

        X_train, X_test, y_train, y_test = split_data(scaled_df, feature_columns, TARGET_COLUMN)

        print(f"\nTraining Features Shape: {X_train.shape}")
        print(f"Testing Features Shape: {X_test.shape}")
        # ... proceed with model training using these sets ...

    except FileNotFoundError:
        print("Execution halted: Input data file is required.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Discussion:

Modularity & Reusability: Each function now performs a single, well-defined task. impute_missing_numeric or scale_features could potentially be reused in other scripts or parts of the project.
Readability: The main script flow under if __name__ == "__main__": is much easier to follow. It reads like a high-level description of the data preparation process.
Testability: Individual functions like impute_missing_numeric can be unit tested with specific inputs to verify their correctness, independent of the rest of the script.
Maintainability: If the imputation logic needs to change, you only need to modify the impute_missing_numeric function. Changes are localized and less likely to break other parts of the code. Returning the fitted_scaler is also important for correctly processing test data later without data leakage.

These exercises demonstrate how applying the principles from this chapter, focusing on readability, leveraging efficient library functions, and structuring code logically, leads to significantly better Python code for machine learning projects. Regularly practice identifying opportunities for refactoring and optimization in your own work.

Was this section helpful?