While histograms give us a good sense of the overall shape and frequency of our data, sometimes we need a more compact visual summary that highlights specific statistical measures. This is where box plots, also known as box-and-whisker plots, come in handy. They provide a standardized way to display the distribution of data based on a five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. They are particularly useful for comparing distributions across different groups.
Anatomy of a Box Plot
A standard box plot consists of several key components:
- The Box: The central box represents the middle 50% of the data. Its bottom edge marks the first quartile (Q1, the 25th percentile), and its top edge marks the third quartile (Q3, the 75th percentile). The length of the box thus represents the Interquartile Range (IQR), calculated as IQR=Q3−Q1. This range contains the central half of your data points.
- The Median Line: A line inside the box indicates the median (Q2, the 50th percentile) of the data. The position of this line within the box gives an idea of the data's symmetry. If the median is closer to Q1, the data below the median is more compressed than the data above it, and vice-versa.
- The Whiskers: Lines extend outwards from the box, typically called whiskers. A common convention is for the whiskers to extend to the lowest data point still within 1.5 IQR of the lower quartile (Q1), and the highest data point still within 1.5 IQR of the upper quartile (Q3). In simpler terms:
- Lower whisker extends to max(minimum_data_point,Q1−1.5×IQR)
- Upper whisker extends to min(maximum_data_point,Q3+1.5×IQR)
Any data point falling outside this range is considered a potential outlier.
- Outliers: Data points that fall outside the range defined by the whiskers are plotted individually, often as dots or asterisks. These points are flagged for potential investigation as they are unusually far from the central mass of the data.
Why Use Box Plots?
Box plots offer several advantages for summarizing data:
- Concise Summary: They effectively display the center (median), spread (IQR), and range (whiskers) of the data.
- Outlier Detection: They provide a standard visual method for identifying potential outliers.
- Comparison: Placing box plots side-by-side is an effective way to compare the distributions of different datasets or subgroups within a dataset. You can quickly compare their medians, IQRs, and the presence of outliers.
- Skewness Indication: The position of the median within the box and the relative lengths of the whiskers can give a visual clue about the skewness of the distribution. A median closer to Q1 and a longer upper whisker suggest positive skew, while a median closer to Q3 and a longer lower whisker suggest negative skew.
Creating Box Plots in Python
Python libraries like Matplotlib and Seaborn make creating box plots straightforward, especially when working with Pandas DataFrames. Let's generate some sample data representing daily temperatures for two cities (City A and City B) and plot them.
import pandas as pd
import numpy as np
import plotly.express as px
# Generate some sample temperature data
np.random.seed(42) # for reproducibility
city_a_temps = np.random.normal(loc=20, scale=5, size=100) # Avg 20C, StdDev 5C
city_b_temps = np.random.normal(loc=25, scale=8, size=100) # Avg 25C, StdDev 8C
# Add a couple of outliers to City A
city_a_temps = np.append(city_a_temps, [3, 45])
# Create a Pandas DataFrame
df = pd.DataFrame({
'Temperature': np.concatenate([city_a_temps, city_b_temps]),
'City': ['City A'] * len(city_a_temps) + ['City B'] * len(city_b_temps)
})
# Create the box plot using Plotly Express
fig = px.box(df, x='City', y='Temperature',
color='City', # Color boxes by city
points="outliers", # Show outliers
title="Daily Temperature Distribution by City",
labels={'Temperature': 'Temperature (°C)', 'City': 'City'},
color_discrete_map={'City A': '#1f77b4', 'City B': '#ff7f0e'} # Optional custom colors
)
# To display the plot (in environments like Jupyter):
# fig.show()
# Here's the JSON representation for embedding:
Side-by-side box plots comparing daily temperatures for City A and City B. Note the median line, the extent of the box (IQR), the whiskers, and the individually marked outliers for City A.
Interpreting the Box Plots
Looking at the chart generated above:
- Median Comparison: The median temperature for City B (the line inside the orange box) is visibly higher than the median temperature for City A (the line inside the blue box).
- Spread (IQR) Comparison: The box for City B is taller than the box for City A, indicating that the middle 50% of temperatures for City B have a larger spread (higher IQR) than those for City A. City B has more temperature variability in its central range.
- Whiskers and Overall Range: The whiskers show the range of the typical data points (excluding outliers). City B's whiskers cover a wider range overall, again suggesting more variability.
- Outliers: City A shows two points plotted individually, far above and below its upper and lower whiskers respectively. These represent the outlier temperatures (3°C and 45°C) we added and might warrant further investigation. City B does not show outliers according to the 1.5 * IQR rule in this sample.
- Skewness: In City A's plot, the median line is roughly centered within the box, and the whiskers are somewhat symmetric (ignoring outliers), suggesting a relatively symmetric distribution for the bulk of the data. City B's median also appears relatively centered.
Box plots provide a powerful way to quickly grasp the essential features of a dataset's distribution and are an indispensable tool for exploratory data analysis, especially when comparing groups. They complement the insights gained from measures like the mean and standard deviation, and from visualizations like histograms.