After exploring the Uniform distribution, where every outcome in a range is equally likely, we now turn to perhaps the most frequently encountered and significant continuous distribution in probability and statistics: the Normal distribution, also widely known as the Gaussian distribution or the bell curve.
Its prevalence isn't accidental. Many natural phenomena, like human heights, measurement errors, and blood pressure, tend to follow a normal distribution. Furthermore, it plays a foundational role in many statistical theories and machine learning algorithms, partly due to the Central Limit Theorem, which we'll discuss later in this chapter.
The Shape of the Normal Distribution
The Normal distribution is characterized by its symmetric, bell-like shape. The curve is centered around its mean, and its spread or width is determined by its standard deviation.
Parameters: Mean and Standard Deviation
A specific Normal distribution is defined by two parameters:
Mean (μ): This parameter represents the center of the distribution. It's the peak of the bell curve and also the average value of the random variable. Changing the mean shifts the entire curve left or right along the number line without changing its shape.
Standard Deviation (σ): This parameter controls the spread or dispersion of the distribution. A smaller standard deviation results in a taller, narrower curve, indicating that data points are tightly clustered around the mean. A larger standard deviation leads to a shorter, wider curve, signifying more variability in the data. The variance (σ2) is also often used in the definition.
The Probability Density Function (PDF)
Recall that for continuous distributions, we use a Probability Density Function (PDF) to describe the likelihood of a variable taking on a value within a specific range (represented by the area under the curve). The PDF for the Normal distribution is given by the formula:
f(x∣μ,σ2)=2πσ21e−2σ2(x−μ)2
Where:
x is the value of the random variable.
μ is the mean.
σ2 is the variance (σ is the standard deviation).
π is the mathematical constant Pi (approximately 3.14159).
e is the base of the natural logarithm (approximately 2.71828).
While the formula might seem intimidating, the main takeaway is that the shape of this curve is entirely determined by the mean μ and the standard deviation σ. The total area under this curve, like any PDF, is always equal to 1.
Visualizing the Normal Distribution
The plot below shows the characteristic bell shape for a standard Normal distribution (where μ=0 and σ=1) and another Normal distribution with a different mean and standard deviation (μ=2,σ=1.5).
Comparison of two Normal distribution PDFs. The blue curve (μ=0, σ=1) is the standard Normal distribution. The green curve (μ=2, σ=1.5) is centered at 2 and is wider due to the larger standard deviation.
The Empirical Rule (68-95-99.7 Rule)
A useful guideline for understanding the spread of a Normal distribution is the Empirical Rule:
Approximately 68% of the data falls within one standard deviation of the mean (i.e., between μ−σ and μ+σ).
Approximately 95% of the data falls within two standard deviations of the mean (i.e., between μ−2σ and μ+2σ).
Approximately 99.7% of the data falls within three standard deviations of the mean (i.e., between μ−3σ and μ+3σ).
This rule provides a quick way to estimate the proportion of data expected within certain ranges for normally distributed data.
Why is the Normal Distribution Important in Machine Learning?
The Normal distribution appears frequently in machine learning contexts:
Modeling Residuals: In many regression models (like linear regression), the assumption is that the errors (or residuals, the difference between predicted and actual values) are normally distributed.
Algorithm Assumptions: Some algorithms, like Gaussian Naive Bayes, explicitly assume that features follow a Normal distribution. Linear Discriminant Analysis (LDA) also assumes normally distributed data within each class.
Central Limit Theorem: As mentioned, this theorem (discussed later) states that the distribution of sample means approaches a Normal distribution as the sample size gets larger, regardless of the original population distribution. This is fundamental for statistical inference.
Parameter Initialization: Weights in neural networks are often initialized using values drawn from a Normal distribution.
Natural Processes: When modeling real-world processes or data that results from the sum of many small, independent effects, the Normal distribution often provides a good approximation.
Understanding the properties of the Normal distribution is therefore essential for applying and interpreting many statistical and machine learning methods correctly. In the upcoming sections, we will see how to generate data points that follow this distribution using Python.