While histograms give us a detailed view of the frequency distribution, box plots (also known as box-and-whisker plots) offer a concise summary of a numerical variable's distribution, focusing on its central tendency, spread, and potential outliers. They are particularly effective for quickly comparing distributions across different categories, although we'll focus on the single variable case here, building on the concepts of central tendency and dispersion discussed earlier.
A box plot visualizes the following key statistics:
Box plots are excellent for getting a quick sense of:
Seaborn provides a straightforward function, sns.boxplot()
, to create informative box plots. It integrates well with Pandas DataFrames.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample DataFrame (assuming you have one loaded, e.g., 'df')
# Let's create some sample data for demonstration
np.random.seed(42)
data = {
'Age': np.random.normal(loc=40, scale=10, size=150).astype(int),
'Salary': np.random.lognormal(mean=np.log(50000), sigma=0.4, size=150)
}
# Introduce a few outliers
data['Salary'][10] = 180000
data['Salary'][50] = 195000
data['Age'][20] = 85
df = pd.DataFrame(data)
df['Age'] = df['Age'].clip(18, 85) # Ensure ages are reasonable
# Create a box plot for the 'Salary' column
plt.figure(figsize=(6, 4)) # Control figure size
sns.boxplot(y=df['Salary'], color='#74c0fc') # Using a blue color from the palette
plt.title('Distribution of Salary')
plt.ylabel('Salary')
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add horizontal grid lines
plt.show()
# Create a box plot for the 'Age' column
plt.figure(figsize=(6, 4))
sns.boxplot(y=df['Age'], color='#69db7c') # Using a green color
plt.title('Distribution of Age')
plt.ylabel('Age')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Executing this code will generate box plots for the 'Salary' and 'Age' columns.
Observe the generated plot for 'Salary':
Similarly, analyze the 'Age' plot to understand its distribution, median, spread, and any potential outliers.
Here is an example representation using Plotly for the 'Salary' data:
Box plot summarizing the Salary distribution, highlighting the median, IQR (box), typical range (whiskers), and potential outliers (individual points).
Box plots provide a compact yet powerful way to grasp the essential features of a numerical variable's distribution and are an indispensable tool in the univariate analysis toolkit. They visually flag potential outliers based on the IQR rule, complementing the statistical methods discussed next.
© 2025 ApX Machine Learning