While numerical summaries like mean, median, and standard deviation give us precise values for central tendency and dispersion, they don't fully capture the shape of a variable's distribution. Is the data clustered around the mean, or is it spread out? Is it symmetric, or skewed to one side? Are there multiple peaks? To answer these questions, we turn to visualizations, and the primary tool for understanding the distribution of a single numerical variable is the histogram.
A histogram groups values into ranges, called "bins" or "intervals," and then plots the frequency (count) of observations that fall into each bin. The horizontal axis represents the range of the variable's values, divided into these bins, while the vertical axis represents the frequency or density of observations within each bin. This creates a bar-like chart that provides a visual approximation of the underlying probability distribution.
The choice of bins is significant. The number of bins (or equivalently, the width of each bin) can significantly affect the appearance of the histogram and the insights we draw from it.
There are rules of thumb for selecting the number of bins (like Sturges' formula or the Freedman-Diaconis rule), but often, the best approach is to experiment with different bin counts or widths to find a representation that clearly reveals the data's structure. Libraries like Matplotlib and Seaborn often provide reasonable defaults (e.g., based on Freedman-Diaconis).
Let's see how to generate histograms using popular Python libraries. We'll assume you have a Pandas DataFrame named df and are interested in a numerical column named 'age'.
Using Pandas:
Pandas DataFrames have a built-in .hist() method, which uses Matplotlib under the hood.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Let's create a sample DataFrame with 'height_cm' to make it more specific
data = {'height_cm': np.random.normal(175, 10, 1000)} # Normal distribution with mean 175cm
df_students = pd.DataFrame(data)
# Using a specific DataFrame name and column name
df_students['height_cm'].hist(bins=15, grid=False, figsize=(8, 5))
plt.title('Distribution of Student Heights')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.show()
Using Matplotlib:
You can use Matplotlib directly for more control.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Let's create a sample DataFrame with 'exam_score'
data = {'exam_score': np.random.normal(75, 15, 500)} # Normal distribution with mean 75
df_class = pd.DataFrame(data)
# Using a specific DataFrame name and column name
plt.figure(figsize=(8, 5))
plt.hist(df_class['exam_score'].dropna(), bins=20, color='#339af0', edgecolor='black')
plt.title('Distribution of Exam Scores')
plt.xlabel('Exam Score')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()
Note the use of .dropna() to handle potential missing values before plotting.
Using Seaborn:
Seaborn often produces more aesthetically pleasing plots and integrates well with Pandas DataFrames. The histplot function is versatile.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Let's create a sample DataFrame with 'age_in_years'
data = {'age_in_years': np.random.randint(20, 60, 200)} # Uniform distribution between 20 and 60
df_employees = pd.DataFrame(data)
# Using a specific DataFrame name and column name
plt.figure(figsize=(8, 5))
sns.histplot(data=df_employees, x='age_in_years', bins=25, kde=True, color='#20c997')
plt.title('Distribution of Employee Ages with Density Curve')
plt.xlabel('Age (Years)')
plt.ylabel('Frequency')
plt.show()
Adding kde=True overlays a Kernel Density Estimate, which provides a smoothed representation of the distribution's shape.
Here is an example using Plotly for an interactive web visualization:
Histogram showing the frequency distribution of a sample numerical dataset.
When examining a histogram, look for these characteristics:
Shape:
Modality (Peaks):
Spread (Dispersion):
Outliers and Gaps:
Histograms provide an immediate visual summary that complements numerical statistics. They help you quickly grasp the nature of your numerical variables, identify potential issues like skewness or outliers, and guide subsequent analysis steps. For instance, strong skewness might suggest considering data transformations later on.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with