While interaction and polynomial features help capture complex relationships between existing features, sometimes transforming a single numerical feature into a categorical one can be beneficial. This process is known as binning or discretization. Binning involves grouping ranges of a continuous numerical variable into distinct intervals, effectively converting it into a categorical feature.
Why would we want to replace a precise numerical value with a broader category? There are several good reasons:
Let's look at common strategies for binning numerical features.
This is perhaps the most straightforward approach. We divide the range of the numerical feature (from its minimum to maximum value) into a predetermined number of bins, each having the same width.
For example, if we have an 'age' feature ranging from 0 to 100 and we want 5 bins, each bin would cover a range of (100−0)/5=20 years. The bins would be [0, 20], (20, 40], (40, 60], (60, 80], and (80, 100].
In Pandas, you can easily achieve this using the cut
function.
import pandas as pd
import numpy as np
# Sample data
data = {'age': np.random.randint(0, 85, size=100)}
df = pd.DataFrame(data)
# Define bin edges explicitly
bin_edges = [0, 18, 35, 60, 85]
bin_labels = ['0-18', '19-35', '36-60', '61+']
# Apply fixed-width binning using specified edges
df['age_bin_fixed'] = pd.cut(df['age'], bins=bin_edges, labels=bin_labels, right=True, include_lowest=True)
# Alternatively, specify only the number of bins (Pandas calculates equal widths)
df['age_bin_fixed_auto'] = pd.cut(df['age'], bins=4, labels=False) # labels=False returns integer codes for bins
print(df[['age', 'age_bin_fixed', 'age_bin_fixed_auto']].head())
# age age_bin_fixed age_bin_fixed_auto
# 0 78 61+ 3
# 1 12 0-18 0
# 2 45 36-60 2
# 3 68 61+ 3
# 4 29 19-35 1
{"layout": {"title": "Distribution Before and After Fixed-Width Binning", "barmode": "overlay", "xaxis": {"title":"Age"}, "yaxis": {"title":"Count"}}, "data": [{"type": "histogram", "x": df['age'].tolist(), "name": "Original Age", "marker": {"color": "#74c0fc"}, "opacity": 0.7}, {"type": "histogram", "x": df['age_bin_fixed'].astype(str).sort_values().tolist(), "name": "Binned Age", "marker": {"color": "#f76707"}, "opacity": 0.7}]}
Distribution of original 'age' compared to the counts within fixed-width bins. Note how the continuous distribution is grouped into discrete categories.
Pros: Simple to understand and implement. Cons: Sensitive to outliers. If the data has extreme values, most data points might cluster into only a few bins, while other bins remain sparse or empty. It doesn't account for the underlying distribution of the data.
Instead of equal widths, quantile-based binning aims to create bins with approximately the same number of observations. It uses percentiles (quantiles) to define the bin edges. For example, if we want 4 bins (quartiles), the edges would be set at the minimum, the 25th percentile, the 50th percentile (median), the 75th percentile, and the maximum value.
This method is often preferred when the data is skewed, as it ensures that each bin has a reasonable amount of data.
Pandas provides the qcut
function for this purpose.
import pandas as pd
import numpy as np
# Sample skewed data (e.g., income)
np.random.seed(42)
income_data = np.random.exponential(scale=20000, size=200) + 15000 # Positively skewed
df_income = pd.DataFrame({'income': income_data})
# Apply quantile-based binning (e.g., into 4 quartiles)
df_income['income_bin_quantile'] = pd.qcut(df_income['income'], q=4, labels=False) # 4 bins
# Can also specify labels
df_income['income_bin_quantile_labeled'] = pd.qcut(df_income['income'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
print(df_income[['income', 'income_bin_quantile', 'income_bin_quantile_labeled']].head())
# income income_bin_quantile income_bin_quantile_labeled
# 0 32338.414699 1 Medium
# 1 22500.455803 0 Low
# 2 58575.039677 2 High
# 3 84972.969536 3 Very High
# 4 18131.939134 0 Low
# Check value counts to see if they are roughly equal
print("\nCounts per quantile bin:")
print(df_income['income_bin_quantile_labeled'].value_counts())
# Counts per quantile bin:
# Low 50
# Medium 50
# High 50
# Very High 50
# Name: income_bin_quantile_labeled, dtype: int64
{"layout": {"title": "Distribution Before and After Quantile Binning (Skewed Data)", "barmode": "overlay", "xaxis": {"title":"Income"}, "yaxis": {"title":"Count"}}, "data": [{"type": "histogram", "x": df_income['income'].tolist(), "name": "Original Income", "marker": {"color": "#74c0fc"}, "nbinsx": 30, "opacity": 0.7}, {"type": "histogram", "x": df_income['income_bin_quantile_labeled'].astype(str).sort_values().tolist(), "name": "Quantile Bins", "marker": {"color": "#20c997"}, "opacity": 0.7}]}
Original skewed 'income' distribution versus the counts within quantile bins. Notice how quantile binning results in roughly equal counts per category, despite the skewness.
Pros: Handles skewed data well, ensures each bin has representation. Often reveals patterns related to rank or relative position. Cons: Bin widths can be very different, potentially merging distinct numerical values if they fall close to a percentile boundary. The interpretation of the bins depends solely on rank, not absolute value ranges (unless labels reflecting ranges are manually created). Loss of information about the absolute magnitude difference.
Sometimes, the best way to bin a feature is based on external information or domain expertise. Standard age groups (0-17, 18-64, 65+), established income tax brackets, or specific thresholds known to be significant in a field (e.g., clinical measurements) are examples. This often leads to the most interpretable and potentially most predictive bins, but requires knowledge beyond the data itself.
# Example: Manual bins based on common age categories
manual_bins = [0, 17, 64, np.inf] # Using np.inf for the upper bound
manual_labels = ['Child/Teen', 'Adult', 'Senior']
df['age_bin_manual'] = pd.cut(df['age'], bins=manual_bins, labels=manual_labels, right=True)
print(df[['age', 'age_bin_manual']].head())
# age age_bin_manual
# 0 78 Senior
# 1 12 Child/Teen
# 2 45 Adult
# 3 68 Senior
# 4 29 Adult
Scikit-learn provides the KBinsDiscretizer
transformer, which fits neatly into ML pipelines. It supports different strategies for determining the bins:
strategy='uniform'
: Equivalent to fixed-width binning (pd.cut
when specifying number of bins).strategy='quantile'
: Equivalent to quantile-based binning (pd.qcut
).strategy='kmeans'
: Uses 1D K-means clustering to find bin edges based on data density.from sklearn.preprocessing import KBinsDiscretizer
# Reshape data for Scikit-learn (expects 2D array)
age_data = df[['age']].values
# Initialize discretizer (e.g., 5 quantile bins, outputting ordinal integers)
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile', subsample=None) # subsample=None to avoid warning on newer versions
# Fit and transform
df['age_bin_sklearn'] = kbd.fit_transform(age_data)
print(df[['age', 'age_bin_sklearn']].head())
# age age_bin_sklearn
# 0 78 4.0
# 1 12 0.0
# 2 45 2.0
# 3 68 4.0
# 4 29 1.0
# You can check the calculated bin edges
print("\nSklearn Bin Edges:", kbd.bin_edges_[0])
# Sklearn Bin Edges: [ 0. 17.4 34.8 51.2 68.6 84. ]
By default, KBinsDiscretizer
outputs numerical representations of the bins (0, 1, 2...). If your downstream model requires categorical features (like One-Hot Encoding), you might set encode='onehot-dense'
or apply a separate encoder after discretization.
Binning provides a powerful way to transform numerical features, potentially revealing non-linear patterns or making your model more robust. Like other feature engineering techniques, the best approach (fixed-width, quantile, manual) and the optimal number of bins often depend on the data distribution and the goals of your analysis.
© 2025 ApX Machine Learning