While calculating statistics like the mean, median, or standard deviation gives you valuable numerical summaries of your data, they don't paint the complete picture. Two datasets could have identical means and standard deviations but look vastly different in terms of how their values are spread out. This is where data visualization comes in, and the histogram is one of the most fundamental and useful tools for understanding the distribution of numerical data.
A histogram provides a visual representation of the frequency distribution of a dataset. Think of it as a bar chart, but instead of representing categories, the bars represent ranges of numerical values, often called bins or intervals. The height of each bar shows how many data points fall into that specific range.
Histograms are excellent for getting a quick sense of your data's underlying structure. They help you:
Creating a histogram involves these steps:
Let's create a histogram using Python. We'll use NumPy
to generate some sample data representing, for example, the heights (in cm) of a group of people, and Matplotlib
(specifically its pyplot
module, often imported as plt
) to plot the histogram.
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import json # Only needed to format the json string for display below
# Generate some sample data (e.g., heights in cm)
# Normally distributed around 170cm with a standard deviation of 10cm
np.random.seed(42) # for reproducibility
heights = np.random.normal(loc=170, scale=10, size=200)
# --- Using Matplotlib (Common for direct plotting) ---
plt.figure(figsize=(8, 5)) # Set the figure size
plt.hist(heights, bins=10, color='#228be6', edgecolor='white') # Create histogram with 10 bins
plt.title('Distribution of Heights')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
# plt.show() # Uncomment this line if running locally to display the plot
# --- Generating Plotly JSON for web embedding ---
# Calculate histogram data using NumPy
counts, bin_edges = np.histogram(heights, bins=10)
bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1]) # Center of bins for plotting
# Create Plotly figure object
fig = go.Figure(data=[go.Bar(
x=bin_centers,
y=counts,
width=np.diff(bin_edges)[0]*0.95, # Adjust width slightly less than bin width
marker_color='#228be6',
marker_line_color='white',
marker_line_width=1
)])
fig.update_layout(
title_text='Distribution of Heights',
xaxis_title_text='Height (cm)',
yaxis_title_text='Frequency',
bargap=0.05, # Small gap for visual clarity if needed, ideally 0 for true histogram
plot_bgcolor='#e9ecef',
yaxis=dict(gridcolor='#adb5bd'),
margin=dict(l=20, r=20, t=40, b=20), # Concise margin
width=600, # Adjusted width for web display
height=400 # Adjusted height for web display
)
# Generate the JSON string (in a real web app, you'd pass the fig object)
plotly_json_string = fig.to_json()
# For display purposes here, we'll print it in the required format
# Usually you would embed this JSON directly or use a library function
# print(f"```plotly\n{plotly_json_string}\n```") # How you'd format for output
A histogram showing the frequency distribution of 200 simulated height measurements. The x-axis shows height ranges (bins), and the y-axis shows the number of people in each range.
Looking at the histogram generated above:
If the histogram had a long tail extending to the right (higher values), we'd call it right-skewed. If the tail extended to the left (lower values), it would be left-skewed. Skewness can affect the relationship between the mean and median. For instance, in a right-skewed distribution, the mean is typically greater than the median.
Histograms are a powerful first step in exploring your numerical data. They complement the summary statistics learned earlier by providing a visual context, helping you understand the shape, center, and spread in a way that numbers alone cannot convey. They are an essential tool for data exploration and preparing data for machine learning models.
© 2025 ApX Machine Learning