Let's put the concepts from this chapter into practice. We'll take a small dataset, load it using Python's Pandas library, calculate the key descriptive statistics we've discussed, and create visualizations to understand the data's characteristics.

Imagine we have collected data on the daily temperatures (in Celsius) recorded over two weeks for a particular city.

Our Sample Dataset:

Daily Temperatures (°C): [22, 25, 19, 21, 24, 26, 23, 20, 22, 25, 28, 24, 21, 23]

1. Setting Up and Loading Data

First, ensure you have Pandas installed (pip install pandas). We'll use it to manage and analyze our data efficiently. Let's load our temperature data into a Pandas Series, which is like a single column of data.

import pandas as pd
import numpy as np # Often useful alongside Pandas

# Our temperature data
temperatures_c = [22, 25, 19, 21, 24, 26, 23, 20, 22, 25, 28, 24, 21, 23]

# Create a Pandas Series
temp_series = pd.Series(temperatures_c, name="Daily Temperature (C)")

# Display the series
print(temp_series)

2. Calculating Measures of Central Tendency

Now, let's find the typical temperature using mean, median, and mode. Pandas provides straightforward methods for this.

# Calculate Mean
mean_temp = temp_series.mean()
print(f"Mean Temperature: {mean_temp:.2f} °C")

# Calculate Median
median_temp = temp_series.median()
print(f"Median Temperature: {median_temp:.2f} °C")

# Calculate Mode
# Note: .mode() returns a Series as there can be multiple modes.
# We'll take the first one if it exists.
mode_temp = temp_series.mode()
if not mode_temp.empty:
    print(f"Mode Temperature(s): {list(mode_temp)} °C")
else:
    print("No unique mode found.")

Interpretation: The mean gives the average temperature ( $\approx 23.07^\circ C$ ). The median ( $23.0^\circ C$ ) is the middle value when the data is sorted, less affected by extreme values. The modes ( $[21, 22, 23, 24, 25]^\circ C$ ) are the most frequently occurring temperatures. In this case, several temperatures appear twice, indicating a relatively flat distribution in the central part. The mean and median are very close, suggesting the data distribution is fairly symmetric.

3. Calculating Measures of Dispersion

How much do the temperatures vary day-to-day? Let's calculate the range, variance, and standard deviation.

# Calculate Range
temp_range = temp_series.max() - temp_series.min()
print(f"Temperature Range: {temp_range} °C")

# Calculate Variance
variance_temp = temp_series.var() # Uses N-1 denominator by default (sample variance)
print(f"Temperature Variance: {variance_temp:.2f} °C^2")

# Calculate Standard Deviation
std_dev_temp = temp_series.std() # Uses N-1 denominator by default (sample standard deviation)
print(f"Temperature Standard Deviation: {std_dev_temp:.2f} °C")

Interpretation: The range ( $9^\circ C$ ) tells us the difference between the hottest and coldest days in our sample. The variance ( $5.53^\circ C^2$ ) and standard deviation ( $\approx 2.35^\circ C$ ) quantify the average spread of the data points around the mean. A standard deviation of 2.35 suggests that most daily temperatures fall roughly within $23.07 \pm 2.35^\circ C$ .

4. Calculating Percentiles and Quartiles

Let's find the quartiles to better understand the data's distribution.

# Calculate Quartiles (25th, 50th, 75th percentiles)
quartiles = temp_series.quantile([0.25, 0.50, 0.75])
print("\nQuartiles:")
print(quartiles)

# Calculate the Interquartile Range (IQR)
q1 = quartiles[0.25]
q3 = quartiles[0.75]
iqr = q3 - q1
print(f"\nInterquartile Range (IQR): {iqr:.2f} °C")

Interpretation:
- Q1 (25th percentile) is $21.25^\circ C$ . 25% of the days had temperatures at or below this value.
- Q2 (50th percentile) is $23.0^\circ C$ , which is the same as the median, as expected.
- Q3 (75th percentile) is $24.75^\circ C$ . 75% of the days had temperatures at or below this value.
- The IQR ( $3.5^\circ C$ ) represents the spread of the middle 50% of the data.

5. Visualizing the Data

Visualizations often provide insights that raw numbers alone cannot. Let's create a histogram and a box plot. We'll use Plotly for interactive web-based plots.

Histogram

A histogram shows the frequency of temperatures falling into different bins.

Histogram showing the counts of days within specific temperature ranges.

Interpretation: The histogram visually confirms our earlier observations. We see peaks around the 21-25°C range, matching the modes we calculated. The distribution looks roughly bell-shaped, though slightly spread out, consistent with the mean and median being close.

Box Plot

A box plot provides a compact summary of the distribution, showing the median, quartiles, and potential outliers.

Box plot illustrating the median (orange line), quartiles (box edges), range (whiskers), and individual data points.

Interpretation: The box plot clearly shows the median ( $23^\circ C$ ), the IQR (the box itself spans from $21.25^\circ C$ to $24.75^\circ C$ ), and the overall range via the whiskers (extending from the minimum $19^\circ C$ to the maximum $28^\circ C$ ). The individual points are also plotted. In this case, the whiskers likely extend to the min/max values as there are no points falling far outside the typical range, which would be classified as outliers based on the standard 1.5 * IQR rule.

Summary

By applying these descriptive statistics techniques and visualizations, we've transformed a simple list of temperatures into a meaningful summary. We understand the typical temperature (around $23^\circ C$ ), the variability ( $\approx \pm 2.35^\circ C$ standard deviation), and the overall distribution shape. This process of summarizing data is a fundamental first step in any data analysis or machine learning task. As you encounter larger and more complex datasets, these foundational techniques will remain essential for gaining initial insights.