Prediction Accuracy Metrics: RMSE and MAE

When a recommendation system's goal is to predict the specific rating a user might give an item, we need metrics that directly measure the accuracy of these predictions. This is often the case for systems that display predicted scores to users, such as "Based on your history, you might rate this movie 4.5 stars." These situations call for prediction accuracy metrics, which evaluate how close a model's predicted ratings are to the actual ratings provided by users.

Two of the most common and fundamental metrics for this task are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). They both quantify the average error in a set of predictions, but they do so in slightly different ways, leading to different interpretations and sensitivities.

Mean Absolute Error (MAE)

Mean Absolute Error is the most straightforward error metric. It measures the average magnitude of the errors in a set of predictions, regardless of their direction. It's the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

The formula for MAE is:

\text{MAE} = \frac{1}{|\hat{R}|} \sum_{(\text{u, i}) \in \hat{R}} |r_{ui} - \hat{r}_{ui}|

Where:

$\hat{R}$ is the set of user-item pairs in your test set.
$|\hat{R}|$ is the total number of ratings in the test set.
$r_{ui}$ is the actual rating given by user $u$ to item $i$ .
$\hat{r}_{ui}$ is the rating predicted by your model for user $u$ and item $i$ .

The interpretation is simple and direct. An MAE of 0.5 means that, on average, your model's prediction is off by 0.5 stars. This makes it an easily communicable metric for business stakeholders.

Let's see how to calculate it in Python. Assuming you have a pandas DataFrame with actual ratings and predicted ratings:

import pandas as pd
import numpy as np

# Sample data
data = {
    'user_id': [1, 1, 2, 2, 3],
    'item_id': [101, 102, 101, 103, 104],
    'actual_rating': [4, 3, 5, 2, 4],
    'predicted_rating': [3.8, 3.5, 4.5, 2.8, 3.9]
}
df = pd.DataFrame(data)

# Calculate MAE from scratch
df['absolute_error'] = abs(df['actual_rating'] - df['predicted_rating'])
mae = df['absolute_error'].mean()

print(f"Calculated MAE: {mae:.4f}")

# Using scikit-learn for convenience
from sklearn.metrics import mean_absolute_error

mae_sklearn = mean_absolute_error(df['actual_rating'], df['predicted_rating'])
print(f"Scikit-learn MAE: {mae_sklearn:.4f}")

Both methods will yield the same result, but using established libraries like scikit-learn is generally preferred as it's less error-prone and more efficient on large datasets.

Root Mean Squared Error (RMSE)

Root Mean Squared Error is another widely used metric for evaluating rating prediction accuracy. While MAE averages the absolute errors, RMSE takes a different approach: it squares the errors before averaging them and then takes the square root of the result.

The formula for RMSE is:

\text{RMSE} = \sqrt{\frac{1}{|\hat{R}|} \sum_{(\text{u, i}) \in \hat{R}} (r_{ui} - \hat{r}_{ui})^2}

The steps are:

Calculate the difference between the actual and predicted rating for each user-item pair ( $r_{ui} - \hat{r}_{ui}$ ).
Square this difference. This step has two effects: it makes all errors positive, and it gives a much higher weight to larger errors. A prediction that is off by 2 stars contributes 4 to the sum, while an error of 0.5 contributes only 0.25.
Calculate the average of these squared errors (this is the Mean Squared Error, or MSE).
Take the square root of the MSE. This final step converts the error back into the same units as the original rating (e.g., stars), making it more interpretable.

Here's the corresponding Python implementation:

# Continuing with the previous DataFrame

# Calculate RMSE from scratch
df['squared_error'] = (df['actual_rating'] - df['predicted_rating'])**2
mse = df['squared_error'].mean()
rmse = np.sqrt(mse)

print(f"Calculated RMSE: {rmse:.4f}")

# Using scikit-learn
from sklearn.metrics import mean_squared_error

# Note: sklearn provides mean_squared_error, so we take the square root
rmse_sklearn = np.sqrt(mean_squared_error(df['actual_rating'], df['predicted_rating']))
print(f"Scikit-learn RMSE: {rmse_sklearn:.4f}")

Comparing RMSE and MAE: The Role of Large Errors

The primary difference between MAE and RMSE lies in how they treat errors of different magnitudes. Because RMSE squares the errors, it penalizes large prediction errors more severely than MAE does. An RMSE value will always be greater than or equal to the MAE value for the same set of predictions. The greater the difference between them, the more variance there is in the individual errors in your sample. A large difference suggests that your model is making a few very large errors.

Let's illustrate this with an example. Picture two scenarios for a model's predictions. In Scenario A, the errors are small and consistent. In Scenario B, most errors are small, but there is one significant outlier.

Scenario A (Consistent Small Errors): Actuals [4, 5, 3], Predictions [3.5, 4.5, 3.5]. Errors are [-0.5, -0.5, 0.5].
Scenario B (One Large Error): Actuals [4, 5, 3], Predictions [3.5, 4.5, 1.0]. Errors are [-0.5, -0.5, -2.0].

For Scenario A:

MAE = $(| -0.5| + |-0.5| + |0.5|) / 3 = 0.5$
RMSE = $\sqrt{((-0.5)^2 + (-0.5)^2 + (0.5)^2) / 3} = \sqrt{(0.25 + 0.25 + 0.25) / 3} = 0.5$

For Scenario B:

MAE = $(|-0.5| + |-0.5| + |-2.0|) / 3 = (0.5 + 0.5 + 2.0) / 3 = 1.0$
RMSE = $\sqrt{((-0.5)^2 + (-0.5)^2 + (-2.0)^2) / 3} = \sqrt{(0.25 + 0.25 + 4.0) / 3} = \sqrt{4.5 / 3} \approx 1.22$

Notice that while the single large error in Scenario B doubled the MAE (from 0.5 to 1.0), it increased the RMSE by a larger factor (from 0.5 to 1.22). The chart below visualizes this sensitivity.

The introduction of a single large prediction error in Scenario B causes a much sharper increase in RMSE compared to MAE, highlighting its sensitivity to outliers.

Which Metric Should You Use?

The choice between MAE and RMSE depends on your application's tolerance for large errors.

Choose MAE if you want a metric that is simple to interpret and strong to outliers. If your business views an error of 2 stars as twice as bad as an error of 1 star, MAE is a good fit. It treats all errors in proportion to their magnitude.
Choose RMSE if large errors are particularly undesirable for your system. If a single prediction that is off by 2 stars is much more damaging than two predictions that are each off by 1 star, RMSE is the better choice. Its penalization of large errors will guide your model optimization toward avoiding such outcomes.

In many practical settings, RMSE is the default metric for evaluating rating prediction models, as large errors can significantly harm the user experience. However, it's always good practice to report both, as the difference between them can provide useful information about the distribution of your model's errors.

While MAE and RMSE are foundational for evaluating prediction accuracy, remember that they are not the whole story. Many recommendation systems are not judged by how well they predict ratings, but by how well they rank items. For that, we need a different class of metrics, which we will explore next.

Was this section helpful?

References

sklearn.metrics Module Documentation, scikit-learn developers, 2023 - Official documentation for the sklearn.metrics module, which includes definitions and implementation details for mean_absolute_error and mean_squared_error.
Recommender Systems: An Introduction, Francesco Ricci, Lior Rokach, Bracha Shapira, Paul B. Kantor, 2011 (Springer) DOI: 10.1007/978-0-387-85820-3 - A comprehensive textbook on recommender systems that discusses the application and interpretation of MAE and RMSE for evaluating rating prediction models.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - A classic textbook on statistical learning, providing the fundamental theoretical and statistical basis for various error metrics used in predictive modeling, including MAE and RMSE.
Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2023 (OTexts) - An authoritative online textbook on forecasting, featuring a dedicated chapter that meticulously compares and contrasts MAE and RMSE, explaining their statistical properties and the implications for model evaluation.