As mentioned in the chapter introduction, the data you feed into machine learning algorithms often needs some adjustment. Think about a dataset containing information about houses: one feature might be the number of bedrooms (typically ranging from 1 to 5), while another might be the house price (ranging from 50,000to1,000,000 or more). These features operate on vastly different scales.
Many machine learning algorithms, especially those that rely on calculating distances between data points (like K-Nearest Neighbors) or use gradient descent for optimization (like Linear Regression, Logistic Regression, and Neural Networks), perform much better, or converge faster, when numerical input features are on a similar scale.
Imagine calculating the distance between two houses using the raw features mentioned above. A difference of 2 bedrooms contributes a small amount to the overall distance calculation compared to a difference of $100,000 in price. The price feature, simply because its values are numerically larger, will dominate the distance calculation. This can lead the algorithm to incorrectly perceive price as being much more important than the number of bedrooms, just because of the difference in scale.
Feature scaling is the process of transforming numerical features to a common scale, without changing the underlying distribution or relationships within the data significantly. It ensures that all features contribute more equally to the learning process.
There are two very common techniques for feature scaling: Normalization and Standardization.
Normalization rescales the values of a feature to a fixed range, typically between 0 and 1. It's calculated by subtracting the minimum value of the feature from each data point and then dividing by the range (maximum value minus minimum value).
The formula is:
x′=max(x)−min(x)x−min(x)Where:
After normalization, the minimum value of the feature becomes 0, and the maximum value becomes 1. All other values fall proportionally within this range.
Example: Consider an 'Age' feature with values [20, 25, 40, 60].
Applying the formula:
Pros: Guarantees that feature values will be within the [0, 1] range. This can be useful for certain algorithms, especially in image processing or neural networks that expect inputs in this range.
Cons: Normalization is quite sensitive to outliers. If you have a single very high or very low value, it can squash most of the other data points into a very small part of the [0, 1] range, potentially losing some information about their relative differences.
Standardization rescales features so that they have a mean (μ) of 0 and a standard deviation (σ) of 1. It subtracts the mean of the feature from each data point and then divides by the standard deviation.
The formula is:
x′=σx−μWhere:
The resulting standardized values (often called Z-scores) represent how many standard deviations away from the mean the original value was. Unlike normalization, standardization does not bound values to a specific range (like [0, 1]), although values far from 0 will be less common if the original data follows a somewhat normal distribution.
Example: Using the same 'Age' feature [20, 25, 40, 60].
Applying the formula:
Pros: Standardization is much less affected by outliers than normalization. If you have outliers, standardization will generally perform better because the mean and standard deviation are less influenced by extreme values compared to the min and max. It's also preferred by algorithms that assume data is centered around zero.
Cons: The resulting values are not restricted to a specific range, which might be a requirement for some specific algorithms (though this is less common).
A comparison showing the original 'Age' values and how they are transformed by Normalization (mapped to [0, 1]) and Standardization (centered around 0 with unit standard deviation).
There's no single best answer, and the choice often depends on the algorithm you plan to use and the nature of your data:
In practice, standardization is often the default choice unless you have a specific reason to use normalization. It's also worth noting that some algorithms, particularly tree-based methods like Decision Trees and Random Forests, are inherently insensitive to the scale of the features and do not strictly require scaling, although applying it usually doesn't hurt performance.
When you start building models using machine learning libraries (like Scikit-learn in Python), you'll find convenient tools to apply these scaling techniques easily to your datasets. Remember to fit the scaler (calculate min/max or mean/std) only on your training data and then use that same fitted scaler to transform both your training and testing data to avoid data leakage (learning information from the test set during preprocessing).
Feature scaling is a simple yet significant step in preparing your data, helping many algorithms learn more effectively and preventing features with larger values from unduly influencing the results.
© 2025 ApX Machine Learning