While measures like the mean, median, mode, variance, and standard deviation tell us about the center and spread of our data, they don't capture the full picture. Two datasets could have the same mean and standard deviation but look vastly different. This is where measures of shape, specifically skewness and kurtosis, come into play. They help us understand the asymmetry and the "tailedness" of our data distribution, respectively.
Skewness quantifies how much a distribution deviates from perfect symmetry. A symmetric distribution, like the classic bell curve (Normal distribution), has a skewness of zero. Its left and right sides are mirror images around the central peak.
Comparing symmetric (blue), positively skewed (orange), and negatively skewed (purple) distributions. Note the position of the long tail relative to the main peak.
Understanding skewness is important because highly skewed data can violate the assumptions of some statistical tests and machine learning models (especially those assuming normality, like linear regression). Sometimes, transformations (like logarithmic transforms) are applied to skewed data to make it more symmetric before modeling.
Kurtosis measures the "tailedness" of a distribution – how much of the data is concentrated in the tails compared to the center. It's often described in relation to the Normal distribution, which is considered mesokurtic.
The standard measure is excess kurtosis, which is calculated as: Excess Kurtosis=Kurtosis−3 A Normal distribution has a kurtosis of 3, so its excess kurtosis is 0.
It's a common misconception that kurtosis only measures the peakedness of a distribution. While peakedness is often related, the primary driver of kurtosis is the weight of the tails. A distribution can have a high peak but light tails, or a lower peak but heavy tails. Kurtosis specifically reflects the impact of extreme values.
Comparing distributions with different kurtosis: leptokurtic (red, heavy tails), mesokurtic (blue, normal tails), and platykurtic (green, light tails).
High kurtosis (leptokurtosis) signals the potential presence of significant outliers or fat-tailed behavior, which is important information for risk management and model selection. Low kurtosis (platykurtosis) might suggest data that is more bounded or uniform than a normal distribution.
Together, skewness and kurtosis provide a more detailed description of your data's distribution beyond simple center and spread. Calculating these values is a standard step in exploratory data analysis (EDA) and helps inform subsequent analytical choices. Libraries like Pandas make computing these metrics straightforward, as we'll see later in this chapter.
© 2025 ApX Machine Learning