After fitting several candidate models, such as different ARIMA configurations or comparing ARIMA with SARIMA, the objective is to select the one that performs best on unseen data. This comparison relies heavily on the evaluation metrics and criteria introduced earlier, applied consistently across all contenders.
The fundamental process involves these steps:
- Consistent Data Splits: Ensure all models are trained on the exact same training dataset and evaluated on the exact same test dataset. Using different data subsets for different models invalidates the comparison. Remember to use appropriate time series splitting techniques discussed previously.
- Metric Calculation: For each model, generate forecasts for the period covered by the test set. Calculate the chosen error metrics (MAE, MSE, RMSE, MAPE) by comparing these forecasts against the actual values in the test set.
- Information Criteria Review: Recall the AIC and BIC values calculated during the model fitting phase (on the training data). While error metrics on the test set measure predictive accuracy, AIC and BIC help compare models based on their goodness-of-fit penalized by complexity.
Let's imagine we have fitted two models to our training data: a non-seasonal ARIMA(1,1,1)
and a SARIMA(1,1,1)(1,1,0,12)
designed to capture yearly seasonality (m=12). We now evaluate them on the test set.
Comparing Metrics on the Test Set
After generating forecasts from both models for the test period, we compute the error metrics. Suppose we obtain the following results:
Model |
MAE |
RMSE |
MAPE (%) |
AIC (from training) |
BIC (from training) |
ARIMA(1,1,1) |
15.2 |
19.8 |
8.5 |
950.5 |
962.1 |
SARIMA(1,1,1)(1,1,0,12) |
9.8 |
12.5 |
5.1 |
885.2 |
902.6 |
Interpreting these results:
- Error Metrics (MAE, RMSE, MAPE): The SARIMA model shows lower values for all error metrics on the test set compared to the ARIMA model. This indicates that its forecasts were, on average, closer to the actual values during the test period. The lower RMSE suggests it particularly avoided large errors better than the ARIMA model. The lower MAPE indicates a smaller average percentage error, which can be useful for relative comparisons.
- Information Criteria (AIC, BIC): The SARIMA model also has lower AIC and BIC values. This suggests that, even considering its higher complexity (more parameters), the significant improvement in fit on the training data (which these criteria measure) justifies the additional parameters. Lower AIC/BIC values generally point towards a better model in terms of balancing fit and parsimony.
In this scenario, both the test set performance and the information criteria point towards the SARIMA(1,1,1)(1,1,0,12)
model as the superior choice for this particular dataset, likely because the underlying data exhibited seasonality that the simple ARIMA model couldn't capture effectively.
Visual Comparison
Besides numerical metrics, plotting the forecasts from different models against the actual values in the test set provides valuable visual insight.
Comparison of actual test data against forecasts from an ARIMA and a SARIMA model. The SARIMA forecast follows the actual data more closely.
This visualization allows you to see where each model performs well or poorly. Does one model consistently overestimate or underestimate? Does it capture turning points better? The SARIMA forecast in the plot above appears to track the actual data more closely than the simpler ARIMA forecast, reinforcing the conclusion drawn from the metrics.
Choosing the "Best" Model
Often, one model will clearly outperform others across most metrics, making the choice straightforward. However, sometimes you might face trade-offs:
- Model A has lower MAE, but Model B has lower RMSE.
- Model C has the best test set metrics, but Model D has significantly lower AIC/BIC and is much simpler.
In such cases, consider:
- Application Context: Is it more important to minimize the average error (favor MAE) or avoid large, infrequent errors (favor RMSE)? If percentage errors are more meaningful to stakeholders, focus on MAPE (being mindful of its limitations with zero or near-zero actual values).
- Parsimony: Simpler models (fewer parameters) are often preferred if their performance is only marginally worse than more complex models. They are generally easier to understand, faster to compute, and potentially more robust to minor changes in the data pattern (less prone to overfitting). AIC and BIC explicitly penalize complexity.
- Residual Analysis: Ensure the chosen model's residuals (on the training data) satisfy the model assumptions (e.g., appear to be white noise). A model might have good forecast metrics but fail diagnostic checks, indicating potential underlying issues.
Comparing models is an iterative process. By systematically calculating metrics, reviewing information criteria, and visually inspecting forecasts on a held-out test set, you can make an informed decision about which model provides the most reliable and accurate predictions for your specific time series problem.