While bar charts help compare discrete categories, we often need to understand the distribution of a single continuous numerical variable. How are the values spread out? Are they clustered around a central point, or are they more evenly distributed? This is where histograms come in handy.
A histogram is a graphical representation that organizes a group of data points into specified ranges or intervals, called bins. It looks similar to a bar chart, but there's a significant difference: a histogram visualizes the frequency distribution of continuous or discrete numerical data, whereas a bar chart compares categorical data. The height of each bar in a histogram represents the number of data points (frequency) that fall within that specific bin.
Think of it like sorting values into buckets. If you have a list of student heights, a histogram could show you how many students fall into the height range 150-160 cm, how many are between 160-170 cm, and so on. This gives you a visual sense of the data's shape, center, and spread.
plt.hist()
Matplotlib provides the plt.hist()
function to create histograms easily. Its most basic usage requires just one argument: the sequence of data (like a list or a NumPy array) whose distribution you want to visualize.
Let's generate some sample data representing, for example, the scores of students on a test, and then plot a histogram. We'll use NumPy to create some normally distributed random data for this illustration.
import matplotlib.pyplot as plt
import numpy as np
# Generate some sample data (e.g., test scores)
# Using a normal distribution centered around 70 with a standard deviation of 10
np.random.seed(42) # for reproducible results
scores = np.random.normal(loc=70, scale=10, size=200)
# Create the histogram
plt.figure(figsize=(8, 5)) # Optional: Adjust figure size
plt.hist(scores, color='#339af0') # Pass the data to plt.hist()
# Add labels and title for clarity
plt.xlabel("Test Score")
plt.ylabel("Number of Students (Frequency)")
plt.title("Distribution of Test Scores")
# Display the plot
plt.show()
This code first generates 200 random scores that roughly follow a normal distribution. Then, plt.hist(scores, color='#339af0')
takes this data and automatically calculates the bin ranges and frequencies to draw the histogram bars. We've also added labels and a title using functions like plt.xlabel()
, plt.ylabel()
, and plt.title()
, which you learned about previously. The result is a plot showing how many student scores fall into various score intervals.
A histogram showing the distribution of 200 simulated test scores. The x-axis shows the score ranges (bins), and the y-axis shows the number of students whose scores fall into each range. The shape approximates a bell curve (normal distribution).
Histograms are fundamental tools in exploratory data analysis. They help you quickly grasp:
By default, plt.hist()
automatically determines the number and width of the bins based on the input data. While this automatic selection is often adequate, the choice of bins can significantly influence the histogram's appearance and potentially alter your interpretation of the data's distribution. We will look into how to control the number and width of these bins in the next section.
© 2025 ApX Machine Learning