While visualizations like histograms and box plots give us a graphical sense of data spread and potential extreme values, statistical methods provide quantitative ways to flag potential outliers. An outlier is generally defined as an observation that lies an abnormal distance from other values in a random sample from a population. Identifying these points is an important part of EDA for several reasons: they might indicate data entry errors, measurement issues, or perhaps genuinely interesting, unusual occurrences in the data. However, they can also significantly influence statistical summaries (like the mean and standard deviation) and affect the performance of some machine learning models.
Two common statistical techniques for identifying potential outliers in numerical data are the Z-score method and the Interquartile Range (IQR) method.
The Z-score measures how many standard deviations a particular data point is away from the mean of the distribution. Assuming the data follows a roughly normal (bell-shaped) distribution, points with Z-scores falling outside a certain threshold are often considered potential outliers.
The formula for calculating the Z-score of a data point x is:
Z=σx−μ
Where:
A common rule of thumb is to flag data points with an absolute Z-score greater than 3 as potential outliers. This threshold is based on the properties of the normal distribution, where approximately 99.7% of the data falls within 3 standard deviations of the mean. Therefore, a value outside this range is statistically rare.
Considerations:
In Pandas, you could calculate the Z-scores for a Series s
using (s - s.mean()) / s.std()
.
The Interquartile Range (IQR) method provides a more robust approach to outlier detection, particularly for datasets that are not normally distributed or when extreme values might unduly influence the mean and standard deviation. This method focuses on the range between the first quartile (Q1, the 25th percentile) and the third quartile (Q3, the 75th percentile) of the data. The IQR represents the spread of the middle 50% of the data.
The steps are:
Why 1.5? The multiplier 1.5 is a standard convention, originating from John Tukey's work in exploratory data analysis. It provides a reasonable balance for identifying values that are notably distant from the central bulk of the data. You might sometimes see a multiplier of 3 used for identifying "extreme" outliers. The whiskers in a standard box plot typically extend to the furthest data points within these 1.5 * IQR bounds.
Considerations:
In Pandas, you can calculate Q1 and Q3 using the .quantile()
method (e.g., s.quantile(0.25)
and s.quantile(0.75)
), compute the IQR, and then apply the bounds to filter the data.
It's important to remember that these statistical methods only flag potential outliers. They don't automatically mean the data point is incorrect or should be removed. Once identified, potential outliers require further investigation:
The decision on how to handle outliers (e.g., correct, remove, transform the data, or use robust statistical methods) depends heavily on the context, the source of the outlier, and the goals of your analysis. During EDA, the primary goal is often just to identify and understand these unusual points.
© 2025 ApX Machine Learning