Alright, let's put the concepts from this chapter into practice. We'll apply the univariate analysis techniques you've learned to a real dataset. This exercise will help solidify your understanding of how to calculate descriptive statistics and create visualizations for both numerical and categorical variables using Python libraries.
We'll use the 'penguins' dataset, which is conveniently available through the Seaborn library. This dataset contains measurements for different penguin species. First, ensure you have Seaborn and Pandas installed and imported. Then, load the dataset:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
penguins_df = sns.load_dataset('penguins')
# Display basic info and first few rows to understand the data
print("Dataset Information:")
penguins_df.info()
print("\nFirst 5 Rows:")
print(penguins_df.head())
# Handle missing values simply for this exercise
# (In a real scenario, you'd use the techniques from Chapter 2)
penguins_df.dropna(inplace=True)
print("\nDataset Information after dropping NaNs:")
penguins_df.info()
This initial output shows us the columns, their data types (like float64 for numerical measurements, object for categorical strings), and confirms the presence of some missing values which we've removed for simplicity in this practice section.
flipper_length_mm
Let's examine the flipper length of the penguins.
We can get a quick statistical summary using the .describe()
method on this specific column (Series).
# Calculate descriptive statistics for flipper_length_mm
flipper_stats = penguins_df['flipper_length_mm'].describe()
print("\nDescriptive Statistics for Flipper Length (mm):")
print(flipper_stats)
# Calculate median separately (often useful)
flipper_median = penguins_df['flipper_length_mm'].median()
print(f"\nMedian Flipper Length: {flipper_median} mm")
# Calculate skewness
flipper_skew = penguins_df['flipper_length_mm'].skew()
print(f"Skewness of Flipper Length: {flipper_skew:.2f}")
Interpretation: The output from .describe()
provides the count, mean, standard deviation (σ), minimum, maximum, and quartile values (25th percentile or Q1, 50th percentile or median, 75th percentile or Q3). The median gives us the central point resistant to outliers. The skewness value (close to 0 indicates approximate symmetry, positive means right-skewed, negative means left-skewed) gives a quick check on the distribution's shape. Here, a skewness close to 0 suggests a fairly symmetric distribution for flipper lengths after handling missing data.
A histogram is excellent for visualizing the distribution's shape, central tendency, and spread.
# Set plot style (optional, for aesthetics)
sns.set_style("whitegrid")
# Create a histogram for flipper_length_mm
plt.figure(figsize=(8, 5))
sns.histplot(data=penguins_df, x='flipper_length_mm', kde=True, bins=15, color='#4dabf7')
plt.title('Distribution of Penguin Flipper Lengths')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Frequency')
plt.show()
Histogram showing the frequency distribution of penguin flipper lengths. The curve (KDE) provides a smooth estimate of the distribution.
Interpretation: The histogram visually confirms the near-symmetric, roughly bell-shaped distribution suggested by the skewness value. Most flipper lengths cluster around the center (mean/median). The kde=True
argument adds a Kernel Density Estimate curve, providing a smoothed outline of the distribution. bins=15
specifies how many bars to use; experimenting with this number can sometimes reveal different features of the distribution.
Box plots are effective for comparing the summary statistics (median, quartiles, range) and identifying potential outliers based on the IQR rule.
# Create a box plot for flipper_length_mm
plt.figure(figsize=(6, 4))
sns.boxplot(data=penguins_df, x='flipper_length_mm', color='#96f2d7')
plt.title('Box Plot of Penguin Flipper Lengths')
plt.xlabel('Flipper Length (mm)')
plt.show()
Box plot summarizing the distribution of penguin flipper lengths. The box represents the IQR (Q1 to Q3), the line inside is the median, and whiskers extend to show the data range (excluding outliers, typically 1.5 * IQR). Points beyond whiskers are potential outliers.
Interpretation: The box plot clearly shows the median (around 200 mm), the IQR (the box itself, roughly 193 mm to 209 mm based on .describe()
), and the overall range indicated by the whiskers. In this specific plot after dropping NaNs, there don't appear to be any points marked as outliers beyond the whiskers, suggesting no extreme values according to the standard IQR rule (Q3 + 1.5IQR or Q1 - 1.5IQR).
species
Now let's investigate the species
column.
For categorical data, we want to know how many observations fall into each category.
# Calculate frequency counts for species
species_counts = penguins_df['species'].value_counts()
print("\nFrequency Counts for Species:")
print(species_counts)
# Calculate proportions (percentages)
species_proportions = penguins_df['species'].value_counts(normalize=True) * 100
print("\nProportions (%) for Species:")
print(species_proportions)
Interpretation: The .value_counts()
method lists each unique species and the number of penguins belonging to it. Setting normalize=True
converts these counts into proportions (or percentages when multiplied by 100), showing the relative frequency of each species in the dataset. We see the Adelie species is the most common in this cleaned dataset.
Bar charts are ideal for comparing the frequencies of different categories.
# Create a bar chart for species counts
plt.figure(figsize=(7, 5))
# Use countplot for direct counting and plotting
sns.countplot(data=penguins_df, x='species', palette=['#ff8787', '#74c0fc', '#74b816'], order=species_counts.index)
plt.title('Number of Penguins per Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()
Bar chart displaying the count of penguins for each species found in the dataset.
Interpretation: The bar chart provides an immediate visual comparison of the species counts, reinforcing that the Adelie species has the highest representation, followed by Gentoo, and then Chinstrap in our processed data. The order=species_counts.index
argument ensures the bars are plotted in descending order of frequency, which is often helpful for readability. Using a specific palette
allows controlling the colors.
In this practice session, you applied univariate analysis techniques to both numerical (flipper_length_mm
) and categorical (species
) variables from the penguins dataset. You calculated essential descriptive statistics and generated histograms, box plots, and bar charts to visualize their distributions and frequencies. This process of examining variables one by one is fundamental to understanding your data's basic characteristics before looking at relationships between variables.
© 2025 ApX Machine Learning