Once you've trained a forecasting model, such as ARIMA or SARIMA, you need to quantify its performance. Simply looking at plots isn't enough for objective comparison or reporting. Evaluation metrics provide a standardized way to measure how close your model's forecasts are to the actual observed values in your test dataset. These metrics focus on the prediction errors, which are the differences between the actual values (Actuali) and the forecasted values (Forecasti) at each time step i.
Let's examine four common metrics used in time series forecasting.
Mean Absolute Error (MAE)
The Mean Absolute Error, or MAE, represents the average absolute difference between the forecasts and the actual values. It tells you, on average, how far off your predictions are from the real outcomes, ignoring the direction of the error (whether you predicted too high or too low).
The formula is:
MAE=n1∑i=1n∣Actuali−Forecasti∣
where n is the number of time points in your evaluation period (e.g., the test set).
Interpretation: The MAE is expressed in the same units as your original time series data. For example, if you are forecasting daily sales in dollars, an MAE of 50 means your forecasts are off by $50, on average.
Pros:
It's easy to understand and explain.
The units are interpretable in the context of the original data.
It's less sensitive to large individual errors (outliers) compared to metrics that square the errors.
Cons:
It treats all errors linearly, meaning an error of 10 is considered exactly twice as bad as an error of 5. This might not reflect the business impact in situations where large errors are disproportionately costly.
Mean Squared Error (MSE)
The Mean Squared Error, or MSE, calculates the average of the squared differences between forecasts and actual values. By squaring the errors, it gives significantly more weight to larger errors than smaller ones.
The formula is:
MSE=n1∑i=1n(Actuali−Forecasti)2
Interpretation: The MSE is measured in the square of the original data units (e.g., dollars squared if forecasting sales). This makes direct interpretation less intuitive compared to MAE. A lower MSE indicates a better fit, but the magnitude itself is harder to relate directly to the scale of the data.
Pros:
It strongly penalizes large errors, which can be desirable if large mistakes are particularly problematic.
The squaring makes it mathematically convenient for some optimization algorithms (though this is more relevant during model training than evaluation).
Cons:
The units (squared units of the original data) make it difficult to interpret directly.
It is highly sensitive to outliers. A few large errors can dominate the MSE value.
Root Mean Squared Error (RMSE)
The Root Mean Squared Error, or RMSE, is simply the square root of the MSE. Taking the square root addresses the primary interpretation issue of MSE by returning the error metric to the original data units.
The formula is:
RMSE=MSE=n1∑i=1n(Actuali−Forecasti)2
Interpretation: Like MAE, the RMSE is expressed in the same units as the original time series data. An RMSE of 55 for daily sales forecasts means the typical magnitude of the error is around 55.Becauseit′sderivedfromsquarederrors,theRMSEwillalwaysbegreaterthanorequaltotheMAE.ThedifferencebetweenRMSEandMAEcanindicatethevarianceintheindividualerrors;alargerdifferencesuggeststhepresenceofsomelargererrorsthatdisproportionatelyinfluenceRMSE$.
Pros:
It has interpretable units (same as the original data).
It still penalizes larger errors more heavily than smaller ones, similar to MSE.
Cons:
It remains sensitive to outliers, although the square root moderates this slightly compared to MSE.
Mean Absolute Percentage Error (MAPE)
The Mean Absolute Percentage Error, or MAPE, calculates the average absolute error as a percentage of the actual values. This makes it a relative measure, independent of the scale of the data.
The formula is:
MAPE=n1∑i=1nActualiActuali−Forecasti×100%
Interpretation:MAPE expresses the average error in percentage terms. A MAPE of 10% means that, on average, the forecast is off by 10% of the actual value. This can be very useful for comparing forecast accuracy across time series with different scales (e.g., forecasting sales for a high-volume product vs. a low-volume product).
Pros:
It's scale-independent, allowing for comparisons across different datasets or items.
It's expressed as a percentage, which is often intuitive for business stakeholders.
Cons:
It produces infinite or undefined values if any actual value is zero (Actuali=0).
It can be unreliable or heavily skewed when actual values are very close to zero.
It puts a heavier penalty on forecasts that are higher than the actual value compared to forecasts that are lower by the same absolute amount. For example, if Actual=100, Forecast=150 gives error |(100-150)/100| = 50%. If Actual=100, Forecast=50 gives error |(100-50)/100| = 50%. But if Actual=10, Forecast=15 gives error |(10-15)/10| = 50%, while if Actual=10, Forecast=5 gives error |(10-5)/10|=50%. The symmetry seems fine here, let's rethink. The issue is asymmetry when the percentage error is considered. If Actual=100, Forecast=150 (error=50), MAPE term is 50%. If Actual=100, Forecast=50 (error=-50), MAPE term is 50%. If Actual=10, Forecast=20 (error=10), MAPE term is |(10-20)/10|=100%. If Actual=10, Forecast=0 (error=-10), MAPE term is |(10-0)/10|=100%. If Actual=200, Forecast=100 (error=-100), MAPE term is |(200-100)/200|=50%. If Actual=100, Forecast=0 (error=-100), MAPE term is |(100-0)/100|=100%. It tends to favor models that under-forecast compared to those that over-forecast, because the error is divided by the actual value.
Visualizing Errors
Metrics give you a single number summary, but visualizing the forecasts against the actual values helps understand where the model is performing poorly.
A simple plot comparing actual values with model forecasts over the test period. The vertical distance between the lines at each point represents the error that the metrics summarize.
Choosing the Right Metric
There's no single "best" metric for all situations. The choice depends on your specific goals and the characteristics of your data:
Use MAE if you want an easily interpretable metric in the original units and if large errors are not disproportionately worse than smaller ones.
Use RMSE if you want an interpretable metric in the original units but need to penalize larger errors more heavily. It's a very common metric for regression and forecasting tasks.
Use MSE primarily if its mathematical properties are advantageous (e.g., in optimization contexts), but be mindful of its interpretability issues and outlier sensitivity.
Use MAPE if you need a scale-independent measure for comparing across different series or if percentage errors are most relevant to stakeholders. However, use it with caution if your data contains zeros or values close to zero.
Often, it's useful to report multiple metrics (e.g., MAE and RMSE) to provide a more complete picture of the forecast accuracy and error characteristics. When comparing different models on the same test set, consistently using the chosen metric(s) allows for objective assessment of which model performs better according to that specific criterion.