Perhaps the most widely recognized and frequently encountered continuous probability distribution in statistics and machine learning is the Normal distribution, also known as the Gaussian distribution or the bell curve. Its prevalence stems not only from its ability to approximate a vast range of natural phenomena but also from its central role in statistical theory, particularly due to the Central Limit Theorem (which we will explore in Chapter 4).
Many real-world measurements, like human height, measurement errors in experiments, or blood pressure, often tend to follow a Normal distribution, at least approximately. This makes it an indispensable tool for modeling continuous data.
Defining the Normal Distribution
A continuous random variable X follows a Normal distribution if its probability density function (PDF) is given by:
f(x∣μ,σ2)=2πσ21e−2σ2(x−μ)2
This distribution is completely characterized by two parameters:
Mean (μ): This parameter represents the center or peak of the distribution. It dictates the location of the bell curve along the horizontal axis.
Variance (σ2): This parameter measures the spread or width of the distribution. A larger variance leads to a shorter, wider curve, while a smaller variance results in a taller, narrower curve. Often, the distribution is parameterized using the standard deviation (σ=σ2), which has the same units as the random variable X.
We denote a Normal distribution with mean μ and variance σ2 as N(μ,σ2).
Properties of the Normal Distribution
The Normal distribution possesses several distinct characteristics:
Bell-Shaped Curve: The graph of its PDF is a symmetric, unimodal bell shape.
Symmetry: The curve is perfectly symmetric around its mean, μ.
Mean, Median, and Mode: Due to its symmetry, the mean, median, and mode of a Normal distribution are all equal (μ).
Total Area: Like any PDF, the total area under the curve is equal to 1.
Asymptotic Tails: The curve approaches the horizontal axis asymptotically, meaning it gets closer and closer but never actually touches it as x moves towards positive or negative infinity.
Normal distributions with different means (μ) and variances (σ2). N(0, 1) is centered at 0 with standard deviation 1. N(3, 1) is shifted right, centered at 3. N(0, 4) is centered at 0 but is wider due to a larger standard deviation of 2 (σ2=4).
The Empirical Rule (68-95-99.7 Rule)
A useful rule of thumb for Normal distributions relates the standard deviation to the proportion of data falling within certain ranges around the mean:
Approximately 68% of the data falls within one standard deviation of the mean (μ±σ).
Approximately 95% of the data falls within two standard deviations of the mean (μ±2σ).
Approximately 99.7% of the data falls within three standard deviations of the mean (μ±3σ).
This rule provides a quick way to understand the spread of data if it's normally distributed.
The Standard Normal Distribution (Z-Distribution)
A special case of the Normal distribution is the Standard Normal distribution, denoted as Z, which has a mean of 0 and a variance (and standard deviation) of 1, i.e., Z∼N(0,1). Its PDF simplifies to:
ϕ(z)=2π1e−2z2
The Standard Normal distribution is particularly important because any Normal distribution X∼N(μ,σ2) can be transformed into a Standard Normal distribution using a simple linear transformation called standardization or calculating the Z-score:
Z=σX−μ
The Z-score tells us how many standard deviations a particular value X is away from the mean μ. This transformation is invaluable because:
Comparison: It allows comparing values from different Normal distributions by putting them on the same scale.
Probability Calculation: Probabilities for any Normal distribution can be calculated using the Cumulative Distribution Function (CDF) of the Standard Normal distribution, often denoted as Φ(z). Standard Normal probability tables or computational functions (like those in SciPy) are widely available. The probability P(X≤x) for X∼N(μ,σ2) is equivalent to P(Z≤σx−μ)=Φ(σx−μ).
The Standard Normal distribution N(0,1), illustrating the approximate areas within 1, 2, and 3 standard deviations from the mean (0) according to the Empirical Rule.
Relevance in Machine Learning
The Normal distribution is foundational in many statistical and machine learning contexts:
Modeling Residuals: In linear regression (covered in Chapter 6), a common assumption is that the errors (residuals) between the predicted and actual values are normally distributed.
Feature Distribution: Some algorithms perform better when input features follow a Normal distribution. Techniques exist to transform non-normally distributed features.
Parameter Initialization: Weights in neural networks are often initialized using values drawn from a Normal distribution.
Algorithm Component: Certain algorithms, like Gaussian Naive Bayes, explicitly assume features follow a Normal distribution within each class. Linear Discriminant Analysis (LDA) also relies on this assumption.
Central Limit Theorem: As mentioned, this theorem states that the distribution of sample means approaches a Normal distribution as the sample size increases, regardless of the population's original distribution. This justifies the use of Normal-based inference in many situations.
In the upcoming practical sections and later chapters, you'll see how to use Python libraries like SciPy (scipy.stats.norm) to calculate probabilities (PDF, CDF), generate random samples (rvs), and fit Normal distributions to data. Understanding its properties is a significant step towards applying many statistical techniques effectively.