In the previous section, we saw how plt.hist()
creates a histogram, giving us a visual sense of how our data is distributed. The bars in a histogram represent the frequency (or count) of data points falling within specific ranges. These ranges are called bins. Think of bins as containers lined up along the number line; each data point gets dropped into the container corresponding to its value. The height of the bar for each container shows how many data points it holds.
Understanding and controlling these bins is fundamental to creating informative histograms. The choice of bins can significantly alter the appearance and, consequently, the interpretation of the data's distribution.
Each bin covers a specific interval along the range of your data. For example, if your data ranges from 0 to 100, you might have bins covering 0-10, 10-20, 20-30, and so on. A data point with a value of 15 would fall into the 10-20 bin, increasing the count (and thus the height of the bar) for that specific bin.
By default, Matplotlib's plt.hist()
function attempts to choose a reasonable number of bins for your data. However, this default isn't always optimal.
Let's see how changing the number of bins affects the resulting histogram. We'll use some sample data drawn from a normal distribution.
import matplotlib.pyplot as plt
import numpy as np
# Generate some sample data
np.random.seed(42) # for reproducibility
data = np.random.randn(200) * 1.5 + 5 # 200 points, mean=5, std=1.5
# --- Plotting with different bin counts ---
plt.figure(figsize=(12, 4)) # Create a figure to hold the subplots
# Plot 1: Too few bins
plt.subplot(1, 3, 1) # (rows, columns, panel number)
plt.hist(data, bins=5, color='#228be6', edgecolor='white')
plt.title('Too Few Bins (bins=5)')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Plot 2: Default number of bins (Matplotlib decides)
plt.subplot(1, 3, 2)
plt.hist(data, color='#15aabf', edgecolor='white') # Let Matplotlib choose bins
plt.title('Default Bins')
plt.xlabel('Value')
# plt.ylabel('Frequency') # Optionally hide y-label for middle plot
# Plot 3: Too many bins
plt.subplot(1, 3, 3)
plt.hist(data, bins=50, color='#40c057', edgecolor='white')
plt.title('Too Many Bins (bins=50)')
plt.xlabel('Value')
# plt.ylabel('Frequency') # Optionally hide y-label
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()
As you can see from the plots generated by the code above:
Let's visualize this comparison with interactive plots.
Histogram using 5 bins. The overall shape is captured, but details are lost.
Histogram using Matplotlib's default number of bins. This often provides a reasonable starting point.
Histogram using 50 bins. This shows excessive detail and noise, making the underlying pattern harder to discern.
You can control the bins in plt.hist()
using the bins
argument:
Specify the Number of Bins: Pass an integer to the bins
argument. Matplotlib will then create that many bins of equal width spanning the range of your data.
# Create a histogram with exactly 20 bins
plt.hist(data, bins=20, color='#845ef7', edgecolor='black')
plt.title('Histogram with 20 Bins')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Specify the Bin Edges: Pass a list or NumPy array defining the exact boundaries (edges) of each bin. This gives you precise control over where the bins start and end. If you provide N edges, you will get N−1 bins.
# Define specific bin edges
bin_edges = [0, 2, 4, 6, 8, 10] # Creates bins: [0,2), [2,4), [4,6), [6,8), [8,10]
plt.hist(data, bins=bin_edges, color='#f76707', edgecolor='black')
plt.title('Histogram with Custom Bin Edges')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.xticks(bin_edges) # Set x-ticks to match bin edges for clarity
plt.show()
Note: The notation [0, 2)
means the bin includes 0 but excludes 2 (except for the last bin, which includes both edges).
So, how many bins should you use? Unfortunately, there's no single perfect answer. It often involves a degree of judgment and depends on:
General Guidelines:
Selecting the right bins is a practical skill gained through experience. Don't be afraid to try different values until the histogram effectively represents your data's underlying pattern.
© 2025 ApX Machine Learning