Analyzing numerical data often involves summarizing its main characteristics using statistical measures. NumPy provides a suite of functions designed for fast computation of common statistics directly on ndarray objects. These functions are highly optimized and form the basis for many analytical tasks in data science and machine learning. Understanding how to use them effectively is an important step in your data analysis toolkit.

Many of these functions can aggregate data across the entire array or along a specific axis, which is particularly useful when working with multi-dimensional data like matrices representing datasets or feature maps.

Basic Aggregations

Let's start with fundamental aggregations that summarize data, such as sums, minimums, and maximums.

Consider a simple array:

import numpy as np

data = np.array([1, 5, 2, 8, 3, 9, 4, 7, 6])

# Calculate the sum of all elements
total_sum = np.sum(data)
print(f"Sum: {total_sum}") # Output: Sum: 45

# Find the minimum and maximum values
min_val = np.min(data)
max_val = np.max(data)
print(f"Min: {min_val}, Max: {max_val}") # Output: Min: 1, Max: 9

These functions work as expected on 1D arrays. Their utility becomes more apparent with multi-dimensional arrays, where you can specify the axis of operation. The axis parameter dictates the dimension along which the function operates: axis=0 operates along columns, and axis=1 operates along rows.

matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Sum of all elements in the matrix
print(f"Total Sum: {np.sum(matrix)}") # Output: Total Sum: 45

# Sum along columns (axis=0)
print(f"Column Sums: {np.sum(matrix, axis=0)}") # Output: Column Sums: [12 15 18]

# Find the maximum value in each row (axis=1)
print(f"Row Maximums: {np.max(matrix, axis=1)}") # Output: Row Maximums: [3 6 9]

# Find the minimum value in each column (axis=0)
print(f"Column Minimums: {np.min(matrix, axis=0)}") # Output: Column Minimums: [1 2 3]

Using the axis parameter allows you to condense information along specific dimensions, a common requirement when summarizing features or samples in a dataset.

Measures of Central Tendency and Dispersion

Beyond simple sums and extremes, NumPy offers functions for calculating measures of central tendency (like mean and median) and dispersion (like standard deviation and variance).

np.mean(): Computes the arithmetic mean.
np.median(): Computes the median (the middle value of sorted data). Robust to outliers compared to the mean.
np.std(): Computes the standard deviation, measuring the spread of data around the mean.
np.var(): Computes the variance, which is the square of the standard deviation.

scores = np.array([75, 82, 88, 91, 65, 95, 88, 78])

# Calculate mean and median
mean_score = np.mean(scores)
median_score = np.median(scores)
print(f"Mean Score: {mean_score:.2f}")     # Output: Mean Score: 82.75
print(f"Median Score: {median_score:.2f}") # Output: Median Score: 85.00

# Calculate variance and standard deviation
variance = np.var(scores)
std_dev = np.std(scores)
print(f"Variance: {variance:.2f}")         # Output: Variance: 97.69
print(f"Standard Deviation: {std_dev:.2f}") # Output: Standard Deviation: 9.88

Like the aggregation functions, these also accept the axis parameter for multi-dimensional arrays:

# Using the 'matrix' from the previous example
print(f"Mean of each column: {np.mean(matrix, axis=0)}") # Output: Mean of each column: [4. 5. 6.]
print(f"Median of each row: {np.median(matrix, axis=1)}") # Output: Median of each row: [2. 5. 8.]
print(f"Std Dev of each column: {np.std(matrix, axis=0)}") # Output: Std Dev of each column: [2.44948974 2.44948974 2.44948974]

Percentiles

Percentiles provide insight into the distribution of data by indicating the value below which a certain percentage of observations fall. np.percentile() is the function for this. The median is, in fact, the 50th percentile.

data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

# Calculate the 25th percentile (first quartile)
q1 = np.percentile(data, 25)
print(f"25th Percentile (Q1): {q1}") # Output: 25th Percentile (Q1): 32.5

# Calculate the 75th percentile (third quartile)
q3 = np.percentile(data, 75)
print(f"75th Percentile (Q3): {q3}") # Output: 75th Percentile (Q3): 77.5

# Calculate multiple percentiles at once
percentiles = np.percentile(data, [10, 50, 90])
print(f"10th, 50th, 90th Percentiles: {percentiles}") # Output: 10th, 50th, 90th Percentiles: [19. 55. 91.]

Percentiles are frequently used in exploratory data analysis to understand data spread and identify potential outliers, often visualized using box plots (which rely on quartiles).

Correlation

Understanding the relationship between different variables (features) is fundamental in machine learning. Correlation measures the linear association between two variables. NumPy's np.corrcoef() function computes the Pearson correlation coefficient matrix.

The Pearson correlation coefficient ranges from -1 to +1:

+1 indicates a perfect positive linear correlation.
-1 indicates a perfect negative linear correlation.
0 indicates no linear correlation.

The function expects an array where each row represents a variable and each column represents an observation.

# Data representing two variables (e.g., height and weight) for 5 individuals
# Variable 1 (Height): row 0
# Variable 2 (Weight): row 1
measurements = np.array([[1.7, 1.8, 1.6, 1.9, 1.5],  # Height (m)
                         [ 70,  85,  65,  90,  60]]) # Weight (kg)

# Calculate the correlation matrix
correlation_matrix = np.corrcoef(measurements)
print("Correlation Matrix:")
print(correlation_matrix)
# Output:
# Correlation Matrix:
# [[1.         0.98137466]  <- Correlation of Height with Height (1) and Weight
#  [0.98137466 1.        ]] <- Correlation of Weight with Height and Weight (1)

The resulting matrix is symmetric. The diagonal elements are always 1 (correlation of a variable with itself). The off-diagonal elements show the correlation between different variables. In this case, the value 0.981 indicates a strong positive linear relationship between height and weight in this sample dataset.

Working with Missing Values (NaN)

Real-world datasets often contain missing values, represented in NumPy as np.nan (Not a Number). Standard statistical functions typically propagate NaN values, meaning the result will also be NaN if any input element is NaN.

data_with_nan = np.array([1.0, 2.0, np.nan, 4.0, 5.0])

print(f"Sum with NaN: {np.sum(data_with_nan)}")     # Output: Sum with NaN: nan
print(f"Mean with NaN: {np.mean(data_with_nan)}")   # Output: Mean with NaN: nan

To handle this, NumPy provides nan-safe versions of many statistical functions (e.g., np.nansum(), np.nanmean(), np.nanmedian(), np.nanstd(), np.nanvar(), np.nanpercentile()). These functions perform the calculation while ignoring any NaN values.

# Using the nan-safe functions
print(f"NaN-ignored Sum: {np.nansum(data_with_nan)}")   # Output: NaN-ignored Sum: 12.0
print(f"NaN-ignored Mean: {np.nanmean(data_with_nan)}") # Output: NaN-ignored Mean: 3.0
print(f"NaN-ignored Max: {np.nanmax(data_with_nan)}")   # Output: NaN-ignored Max: 5.0

Using these nan-safe functions is often essential when performing initial statistical analysis on datasets before more sophisticated imputation techniques are applied.

These NumPy statistical functions provide a powerful and efficient way to compute descriptive statistics, understand data distributions, and explore relationships between variables, forming a critical part of the data exploration and preprocessing stages in machine learning.