Let's put the concepts from this chapter into practice. We'll take a small dataset, load it using Python's Pandas library, calculate the key descriptive statistics we've discussed, and create visualizations to understand the data's characteristics.
Imagine we have collected data on the daily temperatures (in Celsius) recorded over two weeks for a particular city.
Our Sample Dataset:
Daily Temperatures (°C): [22, 25, 19, 21, 24, 26, 23, 20, 22, 25, 28, 24, 21, 23]
1. Setting Up and Loading Data
First, ensure you have Pandas installed (pip install pandas
). We'll use it to manage and analyze our data efficiently. Let's load our temperature data into a Pandas Series, which is like a single column of data.
import pandas as pd
import numpy as np # Often useful alongside Pandas
# Our temperature data
temperatures_c = [22, 25, 19, 21, 24, 26, 23, 20, 22, 25, 28, 24, 21, 23]
# Create a Pandas Series
temp_series = pd.Series(temperatures_c, name="Daily Temperature (C)")
# Display the series
print(temp_series)
2. Calculating Measures of Central Tendency
Now, let's find the typical temperature using mean, median, and mode. Pandas provides straightforward methods for this.
# Calculate Mean
mean_temp = temp_series.mean()
print(f"Mean Temperature: {mean_temp:.2f} °C")
# Calculate Median
median_temp = temp_series.median()
print(f"Median Temperature: {median_temp:.2f} °C")
# Calculate Mode
# Note: .mode() returns a Series as there can be multiple modes.
# We'll take the first one if it exists.
mode_temp = temp_series.mode()
if not mode_temp.empty:
print(f"Mode Temperature(s): {list(mode_temp)} °C")
else:
print("No unique mode found.")
3. Calculating Measures of Dispersion
How much do the temperatures vary day-to-day? Let's calculate the range, variance, and standard deviation.
# Calculate Range
temp_range = temp_series.max() - temp_series.min()
print(f"Temperature Range: {temp_range} °C")
# Calculate Variance
variance_temp = temp_series.var() # Uses N-1 denominator by default (sample variance)
print(f"Temperature Variance: {variance_temp:.2f} °C^2")
# Calculate Standard Deviation
std_dev_temp = temp_series.std() # Uses N-1 denominator by default (sample standard deviation)
print(f"Temperature Standard Deviation: {std_dev_temp:.2f} °C")
4. Calculating Percentiles and Quartiles
Let's find the quartiles to better understand the data's distribution.
# Calculate Quartiles (25th, 50th, 75th percentiles)
quartiles = temp_series.quantile([0.25, 0.50, 0.75])
print("\nQuartiles:")
print(quartiles)
# Calculate the Interquartile Range (IQR)
q1 = quartiles[0.25]
q3 = quartiles[0.75]
iqr = q3 - q1
print(f"\nInterquartile Range (IQR): {iqr:.2f} °C")
5. Visualizing the Data
Visualizations often provide insights that raw numbers alone cannot. Let's create a histogram and a box plot. We'll use Plotly for interactive web-based plots.
Histogram
A histogram shows the frequency of temperatures falling into different bins.
Histogram showing the counts of days within specific temperature ranges.
Box Plot
A box plot provides a compact summary of the distribution, showing the median, quartiles, and potential outliers.
Box plot illustrating the median (orange line), quartiles (box edges), range (whiskers), and individual data points.
Summary
By applying these descriptive statistics techniques and visualizations, we've transformed a simple list of temperatures into a meaningful summary. We understand the typical temperature (around 23∘C), the variability (≈±2.35∘C standard deviation), and the overall distribution shape. This process of summarizing data is a fundamental first step in any data analysis or machine learning task. As you encounter larger and more complex datasets, these foundational techniques will remain essential for gaining initial insights.
© 2025 ApX Machine Learning