Theory is essential, but applying principles through practice solidifies understanding. This section provides hands-on exercises where you will take existing Python code snippets commonly found in machine learning workflows and improve them based on the concepts discussed in this chapter: readability, efficiency, and maintainability.
We'll focus on identifying areas for improvement, applying refactoring techniques, optimizing performance-sensitive parts, and structuring the code more logically.
Consider the following function designed to calculate the mean squared error (MSE) between predicted and actual values. While functional, it can be improved for clarity and adherence to standard practices.
Original Code:
# Function to calculate mse
def calculate_error(d1, d2):
if len(d1) != len(d2):
print("Error: Input arrays must have the same length.")
return None
err = 0
for i in range(len(d1)):
err += (d1[i] - d2[i])**2
mse = err / len(d1)
return mse
# Example Usage
predictions = [2.1, 3.4, 1.9, 5.0, 4.3]
actuals = [2.5, 3.0, 1.5, 5.5, 4.0]
result = calculate_error(predictions, actuals)
if result is not None:
print(f"Calculated MSE: {result}")
Tasks:
d1
, d2
, err
). Add type hints and a docstring explaining the function's purpose, arguments, and return value.None
, raise a specific exception (like ValueError
) for invalid input. This is more idiomatic in Python and allows calling code to handle the error programmatically.Refactored Code:
import numpy as np
from typing import List, Union
def mean_squared_error(y_true: Union[List[float], np.ndarray],
y_pred: Union[List[float], np.ndarray]) -> float:
"""
Calculates the Mean Squared Error (MSE) between true and predicted values.
Args:
y_true: Array-like structure of true target values.
y_pred: Array-like structure of predicted values.
Returns:
The calculated Mean Squared Error.
Raises:
ValueError: If the input arrays have different lengths.
"""
# Convert inputs to NumPy arrays for efficient computation and error checking
y_true = np.array(y_true)
y_pred = np.array(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError("Input arrays must have the same shape.")
if y_true.ndim == 0: # Handle scalar inputs gracefully
raise ValueError("Inputs must be array-like, not scalars.")
# Calculate MSE using vectorized operations
mse = np.mean((y_true - y_pred) ** 2)
return mse
# Example Usage with refactored function
predictions = np.array([2.1, 3.4, 1.9, 5.0, 4.3])
actuals = np.array([2.5, 3.0, 1.5, 5.5, 4.0])
try:
result = mean_squared_error(actuals, predictions)
print(f"Calculated MSE (Refactored): {result:.4f}")
# Example of error handling
different_length_preds = np.array([1.0, 2.0])
mean_squared_error(actuals, different_length_preds)
except ValueError as e:
print(f"Error encountered: {e}")
Discussion:
The refactored version is more robust and efficient.
y_true
, y_pred
, mse
) are descriptive. The docstring clearly explains the function, and type hints improve understanding and allow for static analysis.np.array()
and np.mean()
leverages NumPy's optimized C implementations, which significantly outperform Python loops for numerical tasks, especially on large arrays.ValueError
provides a standard way to signal invalid input, making the function easier to integrate into larger systems where errors need to be caught and handled. Converting inputs to NumPy arrays at the start simplifies the core logic.Imagine you have a Pandas DataFrame and need to apply a conditional transformation: calculate a 'discounted_price' based on the 'category' and 'original_price'. A common, but often inefficient, approach is to iterate over rows.
Original (Inefficient) Code:
import pandas as pd
import time
# Sample DataFrame
data = {
'product_id': range(10000),
'category': ['Electronics', 'Clothing', 'Groceries', 'Books'] * 2500,
'original_price': [100 * (i % 10 + 1) for i in range(10000)]
}
df = pd.DataFrame(data)
# Inefficient approach using iterrows
start_time = time.time()
discounted_prices = []
for index, row in df.iterrows():
price = row['original_price']
category = row['category']
if category == 'Electronics':
discount = 0.10 # 10% discount
elif category == 'Clothing':
discount = 0.15 # 15% discount
else:
discount = 0.05 # 5% discount
discounted_prices.append(price * (1 - discount))
df['discounted_price'] = discounted_prices
end_time = time.time()
print(f"iterrows() duration: {end_time - start_time:.4f} seconds")
print(df.head())
# Cleanup the added column for the next example
df = df.drop(columns=['discounted_price'])
Tasks:
iterrows()
is generally slow for operations that can be vectorized or applied more directly.apply()
method (used judiciously) or, even better, np.select
for conditional logic across columns.Optimized Code using np.select
:
import pandas as pd
import numpy as np
import time
# Sample DataFrame (same as before)
data = {
'product_id': range(10000),
'category': ['Electronics', 'Clothing', 'Groceries', 'Books'] * 2500,
'original_price': [100 * (i % 10 + 1) for i in range(10000)]
}
df = pd.DataFrame(data)
# Optimized approach using np.select for conditional logic
start_time = time.time()
conditions = [
df['category'] == 'Electronics',
df['category'] == 'Clothing'
]
# Corresponding discount rates (as multipliers, 1 - discount)
choices = [
0.90, # 1 - 0.10
0.85 # 1 - 0.15
]
# Default discount rate multiplier
default_choice = 0.95 # 1 - 0.05
discount_multiplier = np.select(conditions, choices, default=default_choice)
df['discounted_price'] = df['original_price'] * discount_multiplier
end_time = time.time()
print(f"np.select duration: {end_time - start_time:.4f} seconds")
print(df.head())
Discussion:
np.select
approach performs the conditional logic and multiplication using highly optimized NumPy operations under the hood, applied across the entire Series at once. This avoids the row-by-row overhead of iterrows()
, resulting in a significant speedup, especially for larger DataFrames. You'll typically observe that the vectorized approach is orders of magnitude faster.np.select
method clearly expresses the conditions and corresponding outcomes. It separates the logic (conditions, choices) from the application.np.select
, boolean masking, and map
(for simple replacements) are preferred. While apply()
can be used, it's often slower than fully vectorized methods like np.select
when applicable.Consider a script that performs several data preparation steps sequentially within the main script body.
Original Code Snippet:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load data
df = pd.read_csv('some_data.csv') # Assume this file exists with numeric features and a target
# --- Step 1: Handle Missing Values ---
# Simple imputation with mean for numerical columns
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
if 'target' in numeric_cols: numeric_cols.remove('target') # Exclude target
for col in numeric_cols:
if df[col].isnull().any():
mean_val = df[col].mean()
df[col].fillna(mean_val, inplace=True)
print("Missing values handled.")
# --- Step 2: Feature Scaling ---
scaler = StandardScaler()
# Avoid scaling the target variable if it's present
features_to_scale = [col for col in numeric_cols if col != 'target']
if features_to_scale: # Check if there are features to scale
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])
print("Features scaled.")
# --- Step 3: Split Data ---
X = df[features_to_scale]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split completed.")
# ... subsequent code uses X_train, X_test, y_train, y_test ...
print(f"Training set shape: {X_train.shape}")
Tasks:
Refactored Code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from typing import Tuple, List
def load_data(filepath: str) -> pd.DataFrame:
"""Loads data from a CSV file."""
try:
df = pd.read_csv(filepath)
print(f"Data loaded successfully from {filepath}")
return df
except FileNotFoundError:
print(f"Error: File not found at {filepath}")
# Depending on context, might raise error or return empty DataFrame
raise
def impute_missing_numeric(df: pd.DataFrame, exclude_cols: List[str] = None) -> pd.DataFrame:
"""Imputes missing values in numeric columns using the mean."""
df_copy = df.copy() # Work on a copy to avoid modifying original DataFrame unexpectedly
if exclude_cols is None:
exclude_cols = []
numeric_cols = df_copy.select_dtypes(include=np.number).columns.tolist()
cols_to_impute = [col for col in numeric_cols if col not in exclude_cols]
for col in cols_to_impute:
if df_copy[col].isnull().any():
mean_val = df_copy[col].mean()
df_copy[col].fillna(mean_val, inplace=True)
print(f"Imputed missing values in column '{col}' with mean {mean_val:.2f}")
return df_copy
def scale_features(df: pd.DataFrame, cols_to_scale: List[str]) -> Tuple[pd.DataFrame, StandardScaler]:
"""Scales specified numerical features using StandardScaler."""
df_copy = df.copy()
scaler = StandardScaler()
if cols_to_scale: # Only scale if columns are provided
df_copy[cols_to_scale] = scaler.fit_transform(df_copy[cols_to_scale])
print(f"Scaled columns: {', '.join(cols_to_scale)}")
else:
print("No columns specified for scaling.")
# Return the scaler so it can be used on test data later
return df_copy, scaler
def split_data(df: pd.DataFrame, feature_cols: List[str], target_col: str, test_size: float = 0.2, random_state: int = 42) -> Tuple:
"""Splits data into training and testing sets."""
X = df[feature_cols]
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
print(f"Data split into training ({X_train.shape[0]} samples) and testing ({X_test.shape[0]} samples)")
return X_train, X_test, y_train, y_test
# --- Main Script Workflow ---
if __name__ == "__main__": # Good practice to put script logic here
try:
FILE_PATH = 'some_data.csv' # Define constants
TARGET_COLUMN = 'target'
# Create dummy data if file doesn't exist for demonstration
try:
pd.read_csv(FILE_PATH)
except FileNotFoundError:
print(f"'{FILE_PATH}' not found. Creating dummy data.")
dummy_data = pd.DataFrame({
'feature1': np.random.rand(100) * 10,
'feature2': np.random.rand(100) * 5 + np.random.choice([np.nan, 1], size=100, p=[0.1, 0.9]), # Add some NaNs
'feature3': np.random.randint(0, 5, 100),
TARGET_COLUMN: np.random.randint(0, 2, 100)
})
dummy_data.to_csv(FILE_PATH, index=False)
raw_df = load_data(FILE_PATH)
# Identify feature columns dynamically (excluding target)
all_numeric_cols = raw_df.select_dtypes(include=np.number).columns.tolist()
feature_columns = [col for col in all_numeric_cols if col != TARGET_COLUMN]
imputed_df = impute_missing_numeric(raw_df, exclude_cols=[TARGET_COLUMN])
scaled_df, fitted_scaler = scale_features(imputed_df, feature_columns)
# Note: For a real scenario, you'd save 'fitted_scaler' and use scaler.transform() on new/test data later.
X_train, X_test, y_train, y_test = split_data(scaled_df, feature_columns, TARGET_COLUMN)
print(f"\nTraining Features Shape: {X_train.shape}")
print(f"Testing Features Shape: {X_test.shape}")
# ... proceed with model training using these sets ...
except FileNotFoundError:
print("Execution halted: Input data file is required.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Discussion:
impute_missing_numeric
or scale_features
could potentially be reused in other scripts or parts of the project.if __name__ == "__main__":
is much easier to follow. It reads like a high-level description of the data preparation process.impute_missing_numeric
can be unit tested with specific inputs to verify their correctness, independent of the rest of the script.impute_missing_numeric
function. Changes are localized and less likely to break other parts of the code. Returning the fitted_scaler
is also important for correctly processing test data later without data leakage.These exercises demonstrate how applying the principles from this chapter, focusing on readability, leveraging efficient library functions, and structuring code logically, leads to significantly better Python code for machine learning projects. Regularly practice identifying opportunities for refactoring and optimization in your own work.
© 2025 ApX Machine Learning