When a recommendation system's goal is to predict the specific rating a user might give an item, we need metrics that directly measure the accuracy of these predictions. This is often the case for systems that display predicted scores to users, such as "Based on your history, you might rate this movie 4.5 stars." These situations call for prediction accuracy metrics, which evaluate how close a model's predicted ratings are to the actual ratings provided by users.
Two of the most common and fundamental metrics for this task are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). They both quantify the average error in a set of predictions, but they do so in slightly different ways, leading to different interpretations and sensitivities.
Mean Absolute Error is the most straightforward error metric. It measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.
The formula for MAE is:
Where:
The interpretation is simple and direct. An MAE of 0.5 means that, on average, your model's prediction is off by 0.5 stars. This makes it an easily communicable metric for business stakeholders.
Let's see how to calculate it in Python. Assuming you have a pandas DataFrame with actual ratings and predicted ratings:
import pandas as pd
import numpy as np
# Sample data
data = {
'user_id': [1, 1, 2, 2, 3],
'item_id': [101, 102, 101, 103, 104],
'actual_rating': [4, 3, 5, 2, 4],
'predicted_rating': [3.8, 3.5, 4.5, 2.8, 3.9]
}
df = pd.DataFrame(data)
# Calculate MAE from scratch
df['absolute_error'] = abs(df['actual_rating'] - df['predicted_rating'])
mae = df['absolute_error'].mean()
print(f"Calculated MAE: {mae:.4f}")
# Using scikit-learn for convenience
from sklearn.metrics import mean_absolute_error
mae_sklearn = mean_absolute_error(df['actual_rating'], df['predicted_rating'])
print(f"Scikit-learn MAE: {mae_sklearn:.4f}")
Both methods will yield the same result, but using established libraries like scikit-learn is generally preferred as it's less error-prone and more efficient on large datasets.
Root Mean Squared Error is another widely used metric for evaluating rating prediction accuracy. While MAE averages the absolute errors, RMSE takes a different approach: it squares the errors before averaging them and then takes the square root of the result.
The formula for RMSE is:
The steps are:
Here's the corresponding Python implementation:
# Continuing with the previous DataFrame
# Calculate RMSE from scratch
df['squared_error'] = (df['actual_rating'] - df['predicted_rating'])**2
mse = df['squared_error'].mean()
rmse = np.sqrt(mse)
print(f"Calculated RMSE: {rmse:.4f}")
# Using scikit-learn
from sklearn.metrics import mean_squared_error
# Note: sklearn provides mean_squared_error, so we take the square root
rmse_sklearn = np.sqrt(mean_squared_error(df['actual_rating'], df['predicted_rating']))
print(f"Scikit-learn RMSE: {rmse_sklearn:.4f}")
The primary difference between MAE and RMSE lies in how they treat errors of different magnitudes. Because RMSE squares the errors, it penalizes large prediction errors more severely than MAE does. An RMSE value will always be greater than or equal to the MAE value for the same set of predictions. The greater the difference between them, the more variance there is in the individual errors in your sample. A large difference suggests that your model is making a few very large errors.
Let's illustrate this with an example. Consider two scenarios for a model's predictions. In Scenario A, the errors are small and consistent. In Scenario B, most errors are small, but there is one significant outlier.
[4, 5, 3], Predictions [3.5, 4.5, 3.5]. Errors are [-0.5, -0.5, 0.5].[4, 5, 3], Predictions [3.5, 4.5, 1.0]. Errors are [-0.5, -0.5, -2.0].For Scenario A:
For Scenario B:
Notice that while the single large error in Scenario B doubled the MAE (from 0.5 to 1.0), it increased the RMSE by a larger factor (from 0.5 to 1.22). The chart below visualizes this sensitivity.
The introduction of a single large prediction error in Scenario B causes a much sharper increase in RMSE compared to MAE, highlighting its sensitivity to outliers.
The choice between MAE and RMSE depends on your application's tolerance for large errors.
In many practical settings, RMSE is the default metric for evaluating rating prediction models, as large errors can significantly harm the user experience. However, it's always good practice to report both, as the difference between them can provide useful information about the distribution of your model's errors.
While MAE and RMSE are foundational for evaluating prediction accuracy, remember that they are not the whole story. Many recommendation systems are not judged by how well they predict ratings, but by how well they rank items. For that, we need a different class of metrics, which we will explore next.
Was this section helpful?
sklearn.metrics module, which includes definitions and implementation details for mean_absolute_error and mean_squared_error.© 2026 ApX Machine LearningEngineered with