All Courses

Binning Numerical Features

While interaction and polynomial features help capture complex relationships between existing features, sometimes transforming a single numerical feature into a categorical one can be beneficial. This process is known as binning or discretization. Binning involves grouping ranges of a continuous numerical variable into distinct intervals, effectively converting it into a categorical feature.

Why would we want to replace a precise numerical value with a broader category? There are several good reasons:

Capturing Non-linear Effects: Some relationships aren't linear. For instance, very young and very old individuals might have lower incomes than those in middle age groups. Binning age into categories like 'Child', 'Young Adult', 'Middle-Aged', 'Senior' allows models, especially linear ones, to capture this U-shaped relationship more easily than using raw age directly.
Reducing Noise and Outlier Impact: Small fluctuations in a numerical variable might be noise. Binning can smooth these out. Similarly, extreme outliers might fall into the first or last bin, preventing them from overly influencing the model parameters, particularly in distance-based algorithms or linear models.
Improving Model Interpretability: Binned features can sometimes be easier to explain. Saying that customers in the ' $50k-$ 75k' income bracket behave differently is often more intuitive than interpreting a small coefficient for the continuous income variable.
Compatibility with Certain Models: Some algorithms might inherently handle categorical features better or require them.

Let's look at common strategies for binning numerical features.

Fixed-Width Binning

This is perhaps the most straightforward approach. We divide the range of the numerical feature (from its minimum to maximum value) into a predetermined number of bins, each having the same width.

For example, if we have an 'age' feature ranging from 0 to 100 and we want 5 bins, each bin would cover a range of $(100 - 0) / 5 = 20$ years. The bins would be [0, 20], (20, 40], (40, 60], (60, 80], and (80, 100].

In Pandas, you can easily achieve this using the cut function.

import pandas as pd
import numpy as np

# Sample data
data = {'age': np.random.randint(0, 85, size=100)}
df = pd.DataFrame(data)

# Define bin edges explicitly
bin_edges = [0, 18, 35, 60, 85]
bin_labels = ['0-18', '19-35', '36-60', '61+']

# Apply fixed-width binning using specified edges
df['age_bin_fixed'] = pd.cut(df['age'], bins=bin_edges, labels=bin_labels, right=True, include_lowest=True)

# Alternatively, specify only the number of bins (Pandas calculates equal widths)
df['age_bin_fixed_auto'] = pd.cut(df['age'], bins=4, labels=False) # labels=False returns integer codes for bins

print(df[['age', 'age_bin_fixed', 'age_bin_fixed_auto']].head())
#    age age_bin_fixed  age_bin_fixed_auto
# 0   78           61+                   3
# 1   12          0-18                   0
# 2   45         36-60                   2
# 3   68           61+                   3
# 4   29         19-35                   1

{"layout": {"title": "Distribution Before and After Fixed-Width Binning", "barmode": "overlay", "xaxis": {"title":"Age"}, "yaxis": {"title":"Count"}}, "data": [{"type": "histogram", "x": df['age'].tolist(), "name": "Original Age", "marker": {"color": "#74c0fc"}, "opacity": 0.7}, {"type": "histogram", "x": df['age_bin_fixed'].astype(str).sort_values().tolist(), "name": "Binned Age", "marker": {"color": "#f76707"}, "opacity": 0.7}]}

Distribution of original 'age' compared to the counts within fixed-width bins. Note how the continuous distribution is grouped into discrete categories.

Pros: Simple to understand and implement. Cons: Sensitive to outliers. If the data has extreme values, most data points might cluster into only a few bins, while other bins remain sparse or empty. It doesn't account for the underlying distribution of the data.

Quantile-Based Binning

Instead of equal widths, quantile-based binning aims to create bins with approximately the same number of observations. It uses percentiles (quantiles) to define the bin edges. For example, if we want 4 bins (quartiles), the edges would be set at the minimum, the 25th percentile, the 50th percentile (median), the 75th percentile, and the maximum value.

This method is often preferred when the data is skewed, as it ensures that each bin has a reasonable amount of data.

Pandas provides the qcut function for this purpose.

import pandas as pd
import numpy as np

# Sample skewed data (e.g., income)
np.random.seed(42)
income_data = np.random.exponential(scale=20000, size=200) + 15000 # Positively skewed
df_income = pd.DataFrame({'income': income_data})

# Apply quantile-based binning (e.g., into 4 quartiles)
df_income['income_bin_quantile'] = pd.qcut(df_income['income'], q=4, labels=False) # 4 bins

# Can also specify labels
df_income['income_bin_quantile_labeled'] = pd.qcut(df_income['income'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

print(df_income[['income', 'income_bin_quantile', 'income_bin_quantile_labeled']].head())
#          income  income_bin_quantile income_bin_quantile_labeled
# 0  32338.414699                    1                      Medium
# 1  22500.455803                    0                         Low
# 2  58575.039677                    2                        High
# 3  84972.969536                    3                   Very High
# 4  18131.939134                    0                         Low

# Check value counts to see if they are roughly equal
print("\nCounts per quantile bin:")
print(df_income['income_bin_quantile_labeled'].value_counts())
# Counts per quantile bin:
# Low          50
# Medium       50
# High         50
# Very High    50
# Name: income_bin_quantile_labeled, dtype: int64

{"layout": {"title": "Distribution Before and After Quantile Binning (Skewed Data)", "barmode": "overlay", "xaxis": {"title":"Income"}, "yaxis": {"title":"Count"}}, "data": [{"type": "histogram", "x": df_income['income'].tolist(), "name": "Original Income", "marker": {"color": "#74c0fc"}, "nbinsx": 30, "opacity": 0.7}, {"type": "histogram", "x": df_income['income_bin_quantile_labeled'].astype(str).sort_values().tolist(), "name": "Quantile Bins", "marker": {"color": "#20c997"}, "opacity": 0.7}]}

Original skewed 'income' distribution versus the counts within quantile bins. Notice how quantile binning results in roughly equal counts per category, despite the skewness.

Pros: Handles skewed data well, ensures each bin has representation. Often reveals patterns related to rank or relative position. Cons: Bin widths can be very different, potentially merging distinct numerical values if they fall close to a percentile boundary. The interpretation of the bins depends solely on rank, not absolute value ranges (unless labels reflecting ranges are manually created). Loss of information about the absolute magnitude difference.

Manual and Domain-Knowledge Binning

Sometimes, the best way to bin a feature is based on external information or domain expertise. Standard age groups (0-17, 18-64, 65+), established income tax brackets, or specific thresholds known to be significant in a field (e.g., clinical measurements) are examples. This often leads to the most interpretable and potentially most predictive bins, but requires knowledge outside the data itself.

# Example: Manual bins based on common age categories
manual_bins = [0, 17, 64, np.inf] # Using np.inf for the upper bound
manual_labels = ['Child/Teen', 'Adult', 'Senior']
df['age_bin_manual'] = pd.cut(df['age'], bins=manual_bins, labels=manual_labels, right=True)

print(df[['age', 'age_bin_manual']].head())
#    age age_bin_manual
# 0   78         Senior
# 1   12     Child/Teen
# 2   45          Adult
# 3   68         Senior
# 4   29          Adult

Using Scikit-learn: KBinsDiscretizer

Scikit-learn provides the KBinsDiscretizer transformer, which fits neatly into ML pipelines. It supports different strategies for determining the bins:

strategy='uniform': Equivalent to fixed-width binning (pd.cut when specifying number of bins).
strategy='quantile': Equivalent to quantile-based binning (pd.qcut).
strategy='kmeans': Uses 1D K-means clustering to find bin edges based on data density.

from sklearn.preprocessing import KBinsDiscretizer

# Reshape data for Scikit-learn (expects 2D array)
age_data = df[['age']].values

# Initialize discretizer (e.g., 5 quantile bins, outputting ordinal integers)
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile', subsample=None) # subsample=None to avoid warning on newer versions

# Fit and transform
df['age_bin_sklearn'] = kbd.fit_transform(age_data)

print(df[['age', 'age_bin_sklearn']].head())
#    age  age_bin_sklearn
# 0   78              4.0
# 1   12              0.0
# 2   45              2.0
# 3   68              4.0
# 4   29              1.0

# You can check the calculated bin edges
print("\nSklearn Bin Edges:", kbd.bin_edges_[0])
# Sklearn Bin Edges: [ 0.  17.4 34.8 51.2 68.6 84. ]

By default, KBinsDiscretizer outputs numerical representations of the bins (0, 1, 2...). If your downstream model requires categorical features (like One-Hot Encoding), you might set encode='onehot-dense' or apply a separate encoder after discretization.

Considerations

Number of Bins: Choosing the number of bins ( $k$ ) is important. Too few bins might oversimplify the data and lose important information. Too many bins might not provide much benefit over the original continuous feature and could lead to sparse categories (overfitting). This choice often involves experimentation or domain knowledge.
Information Loss: Remember that binning is inherently a lossy process. You are discarding the precise numerical value in favor of a range. Ensure the benefits (handling non-linearity, robustness) outweigh this loss for your specific problem and model.
Treat as Categorical: After binning, treat the resulting feature as categorical. This might involve applying appropriate encoding techniques (like One-Hot Encoding or Ordinal Encoding, discussed in Chapter 3) depending on the chosen model, although tree-based models often handle binned features effectively without explicit encoding.

Binning provides a way to transform numerical features, revealing non-linear patterns or making your model stronger. Like other feature engineering techniques, the best approach (fixed-width, quantile, manual) and the optimal number of bins often depend on the data distribution and the goals of your analysis.

Was this section helpful?