Let's put the optimization techniques discussed in this chapter into practice. Feature engineering often involves applying custom transformations to data, and these steps can become significant performance bottlenecks, especially with large datasets. In this hands-on exercise, we will take a typical feature engineering function, profile it to find inefficiencies, and apply optimization techniques like vectorization and Numba to speed it up.
Imagine we need to create new features based on the pairwise interaction between existing numerical features in a dataset. A common interaction feature is the product of two features. For a dataset with features f1,f2,...,fn, we want to compute fi∗fj for all pairs i<j.
Let's start with a reasonably sized dataset simulated using Pandas and NumPy.
import pandas as pd
import numpy as np
import time
import numba
# Generate sample data
num_rows = 100000
num_features = 10
data = pd.DataFrame(np.random.rand(num_rows, num_features),
columns=[f'feature_{i}' for i in range(num_features)])
print("Sample Data Head:")
print(data.head())
print(f"\nData Shape: {data.shape}")
A direct way to implement the pairwise interaction calculation is using nested loops over the feature columns. This approach is easy to understand but often performs poorly in Python, especially with libraries like Pandas that are optimized for vectorized operations.
def calculate_interactions_loops(df):
"""Calculates pairwise feature interactions using nested loops."""
feature_names = df.columns
num_features = len(feature_names)
interaction_data = {}
for i in range(num_features):
for j in range(i + 1, num_features):
col1_name = feature_names[i]
col2_name = feature_names[j]
interaction_col_name = f'{col1_name}_x_{col2_name}'
# Perform element-wise multiplication for the pair
interaction_data[interaction_col_name] = df[col1_name] * df[col2_name]
return pd.DataFrame(interaction_data)
# --- Benchmarking the Loop Implementation ---
start_time = time.time()
interactions_loops = calculate_interactions_loops(data.copy()) # Use copy to avoid modifying original
end_time = time.time()
loop_time = end_time - start_time
print(f"\n--- Baseline: Nested Loops ---")
print(f"Shape of interaction features: {interactions_loops.shape}")
print(f"Execution time: {loop_time:.4f} seconds")
print("Sample Interaction Features:")
print(interactions_loops.head())
# Expected number of interaction features: nC2 = n * (n-1) / 2
# For 10 features: 10 * 9 / 2 = 45
assert interactions_loops.shape[1] == (num_features * (num_features - 1)) // 2
Running this will likely show a noticeable execution time, even for 100,000 rows and 10 features. The slowness comes from iterating through columns explicitly and performing Pandas Series multiplication repeatedly within Python loops.
Before optimizing, let's confirm where the time is spent. We can use %prun
in IPython/Jupyter or the cProfile
module.
import cProfile
import pstats
# Profile the loop-based function
profiler = cProfile.Profile()
profiler.enable()
_ = calculate_interactions_loops(data.copy()) # Run the function under profiler
profiler.disable()
# Print the stats, sorted by cumulative time
stats = pstats.Stats(profiler).sort_stats('cumulative')
print("\n--- Profiling Results (Top 10 cumulative time) ---")
stats.print_stats(10)
The profiling output will likely highlight that a significant portion of the time is spent within the Pandas __mul__
(multiplication) method called repeatedly inside the loops, along with DataFrame/Series indexing operations. This confirms that the repeated element-wise operations within Python loops are the main target for optimization.
Pandas and NumPy excel at vectorized operations, which perform computations on entire arrays at once at the C level, avoiding Python loop overhead. We can reframe the interaction calculation to leverage this. One way is to use itertools.combinations
to get the pairs of column names and then perform the multiplications.
from itertools import combinations
def calculate_interactions_vectorized(df):
"""Calculates pairwise feature interactions using vectorized operations."""
feature_names = df.columns
interaction_data = {}
# Get all combinations of 2 feature names
for col1_name, col2_name in combinations(feature_names, 2):
interaction_col_name = f'{col1_name}_x_{col2_name}'
# Vectorized multiplication of entire columns
interaction_data[interaction_col_name] = df[col1_name] * df[col2_name]
return pd.DataFrame(interaction_data)
# --- Benchmarking the Vectorized Implementation ---
start_time = time.time()
interactions_vectorized = calculate_interactions_vectorized(data.copy())
end_time = time.time()
vectorized_time = end_time - start_time
print(f"\n--- Optimization 1: Vectorized Pandas ---")
print(f"Shape of interaction features: {interactions_vectorized.shape}")
print(f"Execution time: {vectorized_time:.4f} seconds")
print(f"Speedup vs Loops: {loop_time / vectorized_time:.2f}x")
# Sanity check results (optional, compare a few values)
# pd.testing.assert_frame_equal(interactions_loops, interactions_vectorized) # Check if results are identical
You should observe a substantial speedup. Although we still loop through column pairs, the expensive element-wise multiplication now happens only once per pair on the entire column, executed efficiently by Pandas/NumPy.
Sometimes, the logic is complex and cannot be easily vectorized using standard NumPy/Pandas functions. In such cases, Numba can accelerate Python code, especially code involving loops and numerical operations, by compiling it to optimized machine code just-in-time (JIT).
For Numba to be effective, it works best on functions that primarily operate on NumPy arrays and use standard Python loops and numerical types. Let's adapt our interaction calculation to work directly with the underlying NumPy array representation of the DataFrame.
@numba.njit
def calculate_interactions_numba_core(data_array):
"""Core Numba-optimized function for pairwise interactions."""
num_rows, num_features = data_array.shape
num_interactions = (num_features * (num_features - 1)) // 2
# Pre-allocate output array for efficiency
interactions_out = np.empty((num_rows, num_interactions), dtype=data_array.dtype)
interaction_idx = 0
# Nested loops are efficient inside Numba
for i in range(num_features):
for j in range(i + 1, num_features):
# Access columns directly from the NumPy array
col_i = data_array[:, i]
col_j = data_array[:, j]
# Perform element-wise multiplication (NumPy operation within Numba)
interactions_out[:, interaction_idx] = col_i * col_j
interaction_idx += 1
return interactions_out
def calculate_interactions_numba(df):
"""Wrapper function to call the Numba-optimized core."""
feature_names = df.columns
num_features = len(feature_names)
# Get NumPy array representation
data_array = df.to_numpy()
# Call the JIT-compiled function
interactions_array = calculate_interactions_numba_core(data_array)
# Generate column names for the output DataFrame
interaction_col_names = []
for i in range(num_features):
for j in range(i + 1, num_features):
interaction_col_names.append(f'{feature_names[i]}_x_{feature_names[j]}')
return pd.DataFrame(interactions_array, columns=interaction_col_names, index=df.index)
# --- Benchmarking the Numba Implementation ---
# First run might include compilation time
_ = calculate_interactions_numba(data.copy())
# Time the second run (post-compilation)
start_time = time.time()
interactions_numba = calculate_interactions_numba(data.copy())
end_time = time.time()
numba_time = end_time - start_time
print(f"\n--- Optimization 2: Numba JIT ---")
print(f"Shape of interaction features: {interactions_numba.shape}")
print(f"Execution time: {numba_time:.4f} seconds")
print(f"Speedup vs Loops: {loop_time / numba_time:.2f}x")
print(f"Speedup vs Vectorized: {vectorized_time / numba_time:.2f}x")
# Sanity check results (optional, may have minor floating point differences)
# np.allclose(interactions_vectorized.to_numpy(), interactions_numba.to_numpy())
Numba often provides significant speedups for loop-heavy numerical code that is hard to vectorize purely with NumPy/Pandas. Notice that we applied the @njit
decorator (which implies nopython=True
, forcing compilation without falling back to slower object mode) to a core function that works directly on NumPy arrays. The wrapper function handles the conversion from/to Pandas DataFrames. The performance gain here compared to the vectorized Pandas approach might vary depending on the complexity of the operation and the size of the data, but it often surpasses pure Pandas for such explicit looping patterns.
While optimizing for speed, don't forget memory usage.
interactions_out
NumPy array (np.empty
), we control the memory allocation more directly. Working with NumPy arrays can sometimes be more memory-efficient than repeatedly creating Pandas Series, especially if data types are carefully managed (e.g., using np.float32
if precision allows).Tools like memory_profiler
can be used to analyze memory consumption line-by-line if memory becomes a constraint:
# Example using memory_profiler (requires installation: pip install memory_profiler)
# from memory_profiler import profile
#
# @profile # Add this decorator to the function you want to profile
# def calculate_interactions_vectorized_mem(df):
# # ... (same implementation as calculate_interactions_vectorized) ...
# pass
#
# # Run the function to get memory profile output
# if __name__ == '__main__': # Required for memory_profiler on some systems
# # interactions_vectorized_mem = calculate_interactions_vectorized_mem(data.copy())
# pass # Placeholder for actual run
Let's visualize the typical performance differences you might observe.
{"layout": {"title": "Feature Engineering Function Execution Time", "xaxis": {"title": "Implementation"}, "yaxis": {"title": "Time (seconds)", "type": "log"}, "template": "plotly_white", "legend": {"traceorder": "normal"}}, "data": [{"type": "bar", "x": ["Baseline (Loops)", "Vectorized (Pandas)", "Numba (JIT)"], "y": [loop_time, vectorized_time, numba_time], "marker": {"color": ["#fa5252", "#4c6ef5", "#12b886"]}, "name": "Execution Time"}]}
Comparison of execution times for different implementations (log scale). Actual times depend on hardware and data size, but the relative differences are illustrative.
This practical exercise demonstrates a common workflow for optimizing Python ML code:
cProfile
, line_profiler
, memory_profiler
) to identify the actual bottlenecks. Don't guess.Optimizing feature engineering is often critical because it's applied repeatedly during development and potentially on large production datasets. By mastering these Python performance techniques, you can build faster, more scalable machine learning pipelines.
© 2025 ApX Machine Learning