Now that you have your model's predictions for the test set, the data the model hasn't seen before, it's time to measure how well it actually performed. This step involves taking the predictions generated in the previous stage (y_pred
) and comparing them directly against the true, known values (y_true
) from your test set. The evaluation metrics you selected earlier provide the mathematical tools for this comparison.
Think of it like grading an exam. The student (your model) has provided answers (predictions) to questions they haven't seen before (the test set). You have the answer key (the true values). Calculating performance metrics is the process of applying the grading rubric (the chosen metrics like accuracy or MAE) to quantify how many answers were right or how close the answers were.
If you're working on a classification problem (like predicting spam vs. not spam, or identifying different types of objects), you'll use metrics designed for categorical outcomes. Let's revisit the common ones:
Confusion Matrix Components (TP, FP, TN, FN): The first step is often to categorize each prediction on the test set:
y_pred
with its corresponding value in y_true
.Accuracy: This is the most straightforward metric. It's the proportion of total predictions that were correct.
Accuracy=Total Number of PredictionsNumber of Correct Predictions=TP+FP+TN+FNTP+TNYou calculate this by summing the TP and TN counts and dividing by the total number of predictions in your test set.
Precision: Measures the accuracy of positive predictions. Of all the times the model predicted positive, how many were actually positive?
Precision=TP+FPTPUse the TP and FP counts derived from comparing y_pred
and y_true
.
Recall (Sensitivity): Measures how many of the actual positive cases the model correctly identified.
Recall=TP+FNTPUse the TP and FN counts.
F1-Score: The harmonic mean of Precision and Recall, providing a single score that balances both.
F1=2×Precision+RecallPrecision×RecallCalculate Precision and Recall first, then plug them into this formula.
Example Calculation:
Imagine your test set had 10 instances. After running your model, you compare predictions (y_pred
) to actuals (y_true
):
Instance | Actual (y_true ) |
Predicted (y_pred ) |
Outcome |
---|---|---|---|
1 | Positive | Positive | TP |
2 | Positive | Negative | FN |
3 | Negative | Negative | TN |
4 | Positive | Positive | TP |
5 | Negative | Positive | FP |
6 | Negative | Negative | TN |
7 | Positive | Positive | TP |
8 | Negative | Negative | TN |
9 | Positive | Negative | FN |
10 | Negative | Negative | TN |
From this comparison:
Now you can calculate the metrics:
For regression problems, where the model predicts continuous numerical values (like house prices or temperature), you use metrics that measure the magnitude of the errors between the predicted values (y_pred
) and the actual values (y_true
).
Calculate Errors: The foundation for most regression metrics is the error, or residual, for each prediction: ei=ytrue,i−ypred,i. You calculate this difference for every data point in your test set.
Mean Absolute Error (MAE): The average of the absolute errors. It tells you, on average, how far off your predictions are in the original units of your target variable.
MAE=n1i=1∑n∣ytrue,i−ypred,i∣Calculate the absolute error ∣ei∣ for each point, sum them up, and divide by the number of test points (n).
Mean Squared Error (MSE): The average of the squared errors. Squaring the errors penalizes larger mistakes more heavily than smaller ones.
MSE=n1i=1∑n(ytrue,i−ypred,i)2Calculate the squared error ei2 for each point, sum them up, and divide by n. Note that the units are squared (e.g., squared dollars if predicting price).
Root Mean Squared Error (RMSE): The square root of the MSE. Taking the square root brings the metric back into the original units of the target variable, making it easier to interpret than MSE, while still penalizing large errors.
RMSE=MSE=n1i=1∑n(ytrue,i−ypred,i)2Calculate MSE first, then take its square root.
Coefficient of Determination (R-squared, R2): Represents the proportion of the variance in the dependent variable (the actual values) that is predictable from the independent variables (the features used by the model). It ranges from negative infinity to 1. A value of 1 means the model perfectly predicts the data. A value of 0 means the model performs no better than simply predicting the mean of the actual values. It's often calculated using the MSE:
R2=1−∑i=1n(ytrue,i−yˉtrue)2∑i=1n(ytrue,i−y\pred,i)2=1−Variance of ytrueMSEWhere yˉtrue is the mean of the true values in the test set.
Example Calculation:
Suppose your test set has 5 house price predictions:
| Instance | Actual Price (y_true
) | Predicted Price (y_pred
) | Error (ei) | Absolute Error (∣ei∣) | Squared Error (ei2) |
| :------- | :---------------------- | :------------------------- | :------------ | :------------------------ | :---------------------- |
| 1 | 250k∣260k | -10k∣10k | 100M |
| 2 | 300k∣295k | 5k∣5k | 25M |
| 3 | 210k∣225k | -15k∣15k | 225M |
| 4 | 450k∣440k | 10k∣10k | 100M |
| 5 | 320k∣310k | 10k∣10k | 100M |
| Sum | | | | $50k | 550M |
Now calculate the metrics (n=5):
Calculating R-squared requires the variance of the actual prices, but the process involves comparing the sum of squared errors (550M) to the total variance in the actual prices.
While understanding the formulas is important, you won't typically calculate these metrics manually for large datasets. Programming libraries like scikit-learn in Python provide functions that compute these metrics efficiently. You simply provide your test set's true values (y_true
) and your model's predictions (y_pred
), and the library handles the calculations. For example, you might use functions like accuracy_score(y_true, y_pred)
, precision_score(y_true, y_pred)
, mean_squared_error(y_true, y_pred)
, or r2_score(y_true, y_pred)
.
This step gives you concrete numbers that summarize your model's performance on unseen data. The next crucial step is to interpret what these numbers mean in the context of your specific problem.
© 2025 ApX Machine Learning