Now that we understand the concepts behind metrics like accuracy, precision, recall, F1-score, and the confusion matrix, let's see how to compute them efficiently using Scikit-learn's metrics
module. This module provides optimized functions to evaluate your classification models based on the true labels and the predictions generated by your model.
Assuming you have already trained a classifier (like LogisticRegression
, KNeighborsClassifier
, or SVC
as discussed earlier) and obtained predictions on your test set, you will typically have two arrays:
y_true
: The ground truth labels for the test data.y_pred
: The labels predicted by your classifier for the test data.Let's explore how to use these arrays to calculate the standard classification metrics.
First, ensure you import the necessary functions from sklearn.metrics
. For demonstration purposes, let's define some sample true labels and predicted labels. In a real scenario, these would come from your model's evaluation on a test set.
import numpy as np
from sklearn.metrics import (accuracy_score, confusion_matrix, precision_score,
recall_score, f1_score, classification_report)
# Example ground truth labels (binary classification)
y_true = np.array([1, 0, 1, 1, 0, 0, 1, 0, 0, 1])
# Example predicted labels from a hypothetical model
y_pred = np.array([1, 1, 1, 0, 0, 1, 1, 0, 0, 1])
# Example labels for multi-class classification
y_true_multi = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
y_pred_multi = np.array([0, 2, 1, 0, 0, 2, 0, 1, 2])
Accuracy is the simplest metric, representing the proportion of correct predictions. It's calculated as:
Accuracy=Total Number of PredictionsNumber of Correct PredictionsIn Scikit-learn, you use the accuracy_score
function:
# Calculate accuracy
acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc:.4f}")
# Expected Output: Accuracy: 0.7000
While easy to understand, remember accuracy can be misleading, especially for datasets with imbalanced classes.
The confusion matrix provides a more detailed breakdown of prediction performance, showing counts of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP).
Use the confusion_matrix
function:
# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
# Expected Output:
# Confusion Matrix:
# [[3 2]
# [1 4]]
The output is typically arranged as:
[[TN, FP],
[FN, TP]]
So, in our example:
Visualizing the confusion matrix can often make it easier to interpret. Here's how you might generate a heatmap using Plotly:
Breakdown of correct and incorrect predictions for each class.
For multi-class problems, the matrix expands accordingly, showing counts for each actual vs. predicted class pair.
# Confusion matrix for multi-class example
cm_multi = confusion_matrix(y_true_multi, y_pred_multi)
print("\nMulti-class Confusion Matrix:")
print(cm_multi)
# Expected Output:
# Multi-class Confusion Matrix:
# [[3 0 0]
# [1 1 1]
# [0 1 2]]
These metrics provide insights into specific aspects of performance, especially when class imbalance is a concern.
precision_score
.recall_score
.f1_score
.# Calculate Precision, Recall, F1 for the positive class (label 1)
precision = precision_score(y_true, y_pred) # Default: pos_label=1
recall = recall_score(y_true, y_pred) # Default: pos_label=1
f1 = f1_score(y_true, y_pred) # Default: pos_label=1
print(f"\nBinary Classification Metrics (for class 1):")
print(f"Precision: {precision:.4f}") # TP / (TP + FP) = 4 / (4 + 2) = 0.6667
print(f"Recall: {recall:.4f}") # TP / (TP + FN) = 4 / (4 + 1) = 0.8000
print(f"F1-Score: {f1:.4f}") # 2 * (Prec * Rec) / (Prec + Rec) = 0.7273
# Expected Output:
# Binary Classification Metrics (for class 1):
# Precision: 0.6667
# Recall: 0.8000
# F1-Score: 0.7273
Handling Multi-class Metrics:
For multi-class problems, you need to specify how to average these metrics across classes using the average
parameter:
average='micro'
: Calculate metrics globally by counting total TP, FN, FP.average='macro'
: Calculate metrics for each label, and find their unweighted mean. Does not take label imbalance into account.average='weighted'
: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). Accounts for label imbalance.average=None
: Returns the scores for each class individually.# Calculate multi-class metrics with different averaging
precision_macro = precision_score(y_true_multi, y_pred_multi, average='macro')
recall_weighted = recall_score(y_true_multi, y_pred_multi, average='weighted')
f1_micro = f1_score(y_true_multi, y_pred_multi, average='micro')
print(f"\nMulti-class Metrics:")
print(f"Macro Precision: {precision_macro:.4f}")
print(f"Weighted Recall: {recall_weighted:.4f}")
print(f"Micro F1-Score: {f1_micro:.4f}")
# Expected Output:
# Multi-class Metrics:
# Macro Precision: 0.6111
# Weighted Recall: 0.6667
# Micro F1-Score: 0.6667
Often, you'll want a summary of precision, recall, and F1-score for each class, along with the support (number of true instances per class). The classification_report
function provides exactly this in a convenient text format.
# Generate the classification report for binary case
report_binary = classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1'])
print("\nBinary Classification Report:")
print(report_binary)
# Expected Output:
# Binary Classification Report:
# precision recall f1-score support
#
# Class 0 0.75 0.60 0.67 5
# Class 1 0.67 0.80 0.73 5
#
# accuracy 0.70 10
# macro avg 0.71 0.70 0.70 10
# weighted avg 0.71 0.70 0.70 10
# Generate the classification report for multi-class case
report_multi = classification_report(y_true_multi, y_pred_multi, target_names=['Class 0', 'Class 1', 'Class 2'])
print("\nMulti-class Classification Report:")
print(report_multi)
# Expected Output:
# Multi-class Classification Report:
# precision recall f1-score support
#
# Class 0 0.75 1.00 0.86 3
# Class 1 1.00 0.33 0.50 3
# Class 2 0.67 0.67 0.67 3
#
# accuracy 0.67 9
# macro avg 0.81 0.67 0.67 9
# weighted avg 0.81 0.67 0.67 9
The report includes:
y_true
.Scikit-learn's metrics
module offers a straightforward way to quantify the performance of your classification models. Using functions like accuracy_score
, confusion_matrix
, precision_score
, recall_score
, f1_score
, and the comprehensive classification_report
allows you to move beyond simple accuracy and gain a deeper understanding of how your model behaves across different classes, which is essential for building effective classification systems.
© 2025 ApX Machine Learning