While numerical summaries like mean, median, and standard deviation give us precise values for central tendency and dispersion, they don't fully capture the shape of a variable's distribution. Is the data clustered around the mean, or is it spread out? Is it symmetric, or skewed to one side? Are there multiple peaks? To answer these questions, we turn to visualizations, and the primary tool for understanding the distribution of a single numerical variable is the histogram.
A histogram groups values into ranges, called "bins" or "intervals," and then plots the frequency (count) of observations that fall into each bin. The horizontal axis represents the range of the variable's values, divided into these bins, while the vertical axis represents the frequency or density of observations within each bin. This creates a bar-like chart that provides a visual approximation of the underlying probability distribution.
The choice of bins is significant. The number of bins (or equivalently, the width of each bin) can significantly affect the appearance of the histogram and the insights we draw from it.
There are rules of thumb for selecting the number of bins (like Sturges' formula or the Freedman-Diaconis rule), but often, the best approach is to experiment with different bin counts or widths to find a representation that clearly reveals the data's structure. Libraries like Matplotlib and Seaborn often provide reasonable defaults (e.g., based on Freedman-Diaconis).
Let's see how to generate histograms using popular Python libraries. We'll assume you have a Pandas DataFrame named df
and are interested in a numerical column named 'age'
.
Using Pandas:
Pandas DataFrames have a built-in .hist()
method, which uses Matplotlib under the hood.
import pandas as pd
import matplotlib.pyplot as plt
# Assuming 'df' is your DataFrame and 'age' is the column
df['age'].hist(bins=15, grid=False, figsize=(8, 5)) # Experiment with the number of bins
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Using Matplotlib: You can use Matplotlib directly for more control.
import matplotlib.pyplot as plt
# Assuming 'df' is your DataFrame and 'age' is the column
plt.figure(figsize=(8, 5))
plt.hist(df['age'].dropna(), bins=20, color='#339af0', edgecolor='black') # dropna() is important
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()
Note the use of .dropna()
to handle potential missing values before plotting.
Using Seaborn:
Seaborn often produces more aesthetically pleasing plots and integrates well with Pandas DataFrames. The histplot
function is versatile.
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming 'df' is your DataFrame and 'age' is the column
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='age', bins=25, kde=True, color='#20c997') # kde=True adds a density curve
plt.title('Distribution of Age with Density Curve')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Adding kde=True
overlays a Kernel Density Estimate, which provides a smoothed representation of the distribution's shape.
Here is an example using Plotly for an interactive web visualization:
Histogram showing the frequency distribution of a sample numerical dataset.
When examining a histogram, look for these characteristics:
Shape:
Modality (Peaks):
Spread (Dispersion):
Outliers and Gaps:
Histograms provide an immediate visual summary that complements numerical statistics. They help you quickly grasp the nature of your numerical variables, identify potential issues like skewness or outliers, and guide subsequent analysis steps. For instance, strong skewness might suggest considering data transformations later on.
© 2025 ApX Machine Learning