You've learned how to prepare categorical data for machine learning models. Now, let's consider numerical features. You might wonder, "Why can't we just feed the raw numerical values directly into our algorithms?" After all, they are already numbers. While some models can handle raw numerical inputs reasonably well, many perform significantly better, or converge much faster, when the numerical features are scaled.
Imagine a dataset containing two features: a person's annual_income
(ranging from, say, 20,000to200,000) and their years_of_experience
(ranging from 0 to 40). If we use these features directly in certain algorithms, the sheer magnitude of the annual_income
values could overshadow the contribution of years_of_experience
, simply because its numbers are much larger. This isn't necessarily because income is inherently more important, but purely an artifact of the scale. Feature scaling aims to prevent this distortion by bringing all numerical features onto a comparable footing.
Many machine learning algorithms rely on calculating distances between data points. Examples include:
Consider the standard Euclidean distance calculation between two points p and q in two dimensions (x and y):
Distance(p,q)=(px−qx)2+(py−qy)2
If feature x (e.g., annual_income
) has a range of 100,000 while feature y (e.g., years_of_experience
) has a range of 40, a difference of 10,000 in income (px−qx=10000) will contribute (10000)2=100,000,000 to the sum inside the square root. A difference of 10 in experience (py−qy=10) contributes only 102=100. The income feature completely dominates the distance calculation. Scaling ensures that both features contribute more equitably to the distance measure, reflecting their relative importance more accurately.
Algorithms optimized using gradient descent also benefit significantly from feature scaling. This includes:
Gradient descent updates model parameters (weights) by taking steps proportional to the negative gradient of the cost function. The update rule for a weight wj often looks like:
wj:=wj−α∂wj∂J
Here, α is the learning rate and ∂wj∂J is the partial derivative of the cost function J with respect to weight wj. If features have vastly different scales, the corresponding gradients can also vary significantly.
Consider the shape of a hypothetical cost function surface for two parameters. Without scaling, if the features have very different ranges, the contours might look elongated. With scaling, the contours become more circular, making it easier for gradient descent to find the minimum.
On the left, elongated contours represent a cost surface where features have different scales, potentially slowing gradient descent. On the right, more circular contours represent the cost surface after scaling, allowing for more direct convergence.
Tree-based algorithms, such as Decision Trees, Random Forests, and Gradient Boosting Machines, are generally considered less sensitive to the scale of features. This is because they operate by partitioning the data based on feature thresholds (feature>threshold). The choice of threshold is independent of the feature's overall scale.
However, even with tree-based models, scaling might sometimes be indirectly beneficial, for instance:
Feature scaling is a fundamental step in preparing numerical data for many machine learning algorithms. It addresses the issue where features with larger value ranges might disproportionately influence model training, not due to their inherent importance, but simply due to their scale. By bringing features to a similar scale, we help distance-based algorithms make more meaningful comparisons and enable gradient descent-based algorithms to converge faster and more reliably. The following sections will explore specific techniques like Standardization, Normalization, and Robust Scaling to achieve this.
© 2025 ApX Machine Learning