Let's put the evaluation techniques discussed in this chapter into practice. We'll use Python to calculate performance metrics and compare different forecasting models applied to a time series dataset. This hands-on exercise assumes you have already prepared your time series data and potentially fitted a couple of models (like an ARIMA and a SARIMA model from previous chapters) that you now wish to evaluate.
First, ensure you have the necessary libraries imported. We'll need pandas for data handling, statsmodels
for potential model fitting (though we'll assume models are already fitted for this exercise), sklearn.metrics
for calculating error metrics, and matplotlib
or plotly
for visualization.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
import matplotlib.pyplot as plt
# Assume statsmodels is used for model fitting and forecasting:
# from statsmodels.tsa.arima.model import ARIMA
# from statsmodels.tsa.statespace.sarimax import SARIMAX
# Load your time series data (replace with your actual data loading)
# Example: data = pd.read_csv('your_time_series.csv', index_col='Date', parse_dates=True)
# For demonstration, let's create placeholder data and forecasts
dates = pd.date_range(start='2022-01-01', periods=100, freq='D')
data = pd.Series(np.random.randn(100).cumsum() + 50, index=dates)
# Assume you have fitted models from previous steps, e.g.:
# arima_fit = ARIMA(train_data, order=(p, d, q)).fit()
# sarima_fit = SARIMAX(train_data, order=(p, d, q), seasonal_order=(P, D, Q, m)).fit()
# Placeholder fitted models and forecasts for demonstration
# In a real scenario, these would come from your model fitting process
train_size = 80
train_data = data[:train_size]
test_data = data[train_size:]
# Placeholder forecasts - replace with your actual model predictions
arima_forecast = test_data + np.random.randn(len(test_data)) * 2 # Noisy forecast 1
sarima_forecast = test_data + np.random.randn(len(test_data)) * 1 # Noisy forecast 2 (better)
# Ensure forecasts are pandas Series with the same index as test_data
arima_forecast.index = test_data.index
sarima_forecast.index = test_data.index
# Placeholder AIC/BIC values (replace with actual values from model summaries)
arima_aic = 310.5
arima_bic = 320.1
sarima_aic = 305.2
sarima_bic = 318.8
print("Setup complete. Test data and placeholder forecasts are ready.")
print(f"Test data length: {len(test_data)}")
As covered earlier, splitting time series data requires care to maintain temporal order. We typically select a point in time and use all data before it for training and all data after it for testing. Our placeholder code above simulates this by defining train_data
and test_data
.
# Code for splitting was shown in the setup block
print("Original Data:")
print(data.head())
print("\nTraining Data Head:")
print(train_data.head())
print("\nTesting Data Head:")
print(test_data.head())
print(f"\nTraining size: {len(train_data)}, Test size: {len(test_data)}")
With your models fitted on the train_data
, the next step is to generate forecasts for the time periods covered by test_data
. The statsmodels
library provides methods like .predict()
or .forecast()
for this. The start and end points for prediction should align with the index of your test_data
.
# Placeholder forecasts were generated in the Setup section.
# In a real scenario, you would use your fitted models:
# arima_forecast = arima_fit.predict(start=test_data.index[0], end=test_data.index[-1])
# sarima_forecast = sarima_fit.predict(start=test_data.index[0], end=test_data.index[-1])
print("ARIMA Forecasts (Placeholder):")
print(arima_forecast.head())
print("\nSARIMA Forecasts (Placeholder):")
print(sarima_forecast.head())
Now we apply the metrics discussed earlier (MAE, MSE, RMSE, MAPE) to compare the forecasts against the actual values in test_data
.
# Ensure actuals and forecasts align
actual_values = test_data
# Calculate metrics for the ARIMA model
arima_mae = mean_absolute_error(actual_values, arima_forecast)
arima_mse = mean_squared_error(actual_values, arima_forecast)
arima_rmse = np.sqrt(arima_mse) # Calculate RMSE from MSE
arima_mape = mean_absolute_percentage_error(actual_values, arima_forecast)
print("--- ARIMA Model Evaluation ---")
print(f"MAE: {arima_mae:.4f}")
print(f"MSE: {arima_mse:.4f}")
print(f"RMSE: {arima_rmse:.4f}")
print(f"MAPE: {arima_mape:.4f}")
# Calculate metrics for the SARIMA model
sarima_mae = mean_absolute_error(actual_values, sarima_forecast)
sarima_mse = mean_squared_error(actual_values, sarima_forecast)
sarima_rmse = np.sqrt(sarima_mse) # Calculate RMSE from MSE
sarima_mape = mean_absolute_percentage_error(actual_values, sarima_forecast)
print("\n--- SARIMA Model Evaluation ---")
print(f"MAE: {sarima_mae:.4f}")
print(f"MSE: {sarima_mse:.4f}")
print(f"RMSE: {sarima_rmse:.4f}")
print(f"MAPE: {sarima_mape:.4f}")
Interpreting the Metrics:
Based on these metrics (using our placeholder results), the SARIMA model appears to perform better than the ARIMA model on the test set, as it consistently shows lower error values.
AIC and BIC are calculated on the training data during the model fitting process. They help compare models by balancing goodness of fit with model complexity. Lower AIC and BIC values are generally preferred.
# Access AIC/BIC (using placeholder values from Setup)
# In practice, get these from your model fit summary:
# arima_aic = arima_fit.aic
# arima_bic = arima_fit.bic
# sarima_aic = sarima_fit.aic
# sarima_bic = sarima_fit.bic
print("--- Information Criteria (from Training Fit) ---")
print(f"ARIMA - AIC: {arima_aic:.2f}, BIC: {arima_bic:.2f}")
print(f"SARIMA - AIC: {sarima_aic:.2f}, BIC: {sarima_bic:.2f}")
# Interpretation
if sarima_aic < arima_aic:
print("\nSARIMA has a lower AIC, suggesting a better balance of fit and complexity.")
else:
print("\nARIMA has a lower AIC, suggesting a better balance of fit and complexity.")
if sarima_bic < arima_bic:
print("SARIMA has a lower BIC, suggesting a better balance of fit and complexity (with a stronger penalty for parameters).")
else:
print("ARIMA has a lower BIC, suggesting a better balance of fit and complexity (with a stronger penalty for parameters).")
In our placeholder example, the SARIMA model also shows lower AIC and BIC values, aligning with the findings from the error metrics.
A plot comparing the actual values from the test set against the forecasts from different models provides an intuitive way to assess performance.
plt.figure(figsize=(12, 6))
plt.plot(train_data.index, train_data, label='Training Data', color='#adb5bd')
plt.plot(test_data.index, actual_values, label='Actual Values (Test)', color='#1c7ed6', linewidth=2)
plt.plot(arima_forecast.index, arima_forecast, label=f'ARIMA Forecast (RMSE: {arima_rmse:.2f})', color='#ff922b', linestyle='--')
plt.plot(sarima_forecast.index, sarima_forecast, label=f'SARIMA Forecast (RMSE: {sarima_rmse:.2f})', color='#51cf66', linestyle=':')
plt.title('Actual vs. Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True, linestyle=':', alpha=0.6)
plt.tight_layout()
plt.show()
The plot visually confirms how well each model's forecast tracks the actual data points in the test period. You can see which forecast (ARIMA or SARIMA) generally stays closer to the actual blue line.
Here is a similar comparison using Plotly for an interactive visualization:
Comparison of actual test data against placeholder ARIMA and SARIMA forecasts. Lower errors typically correspond to forecasts that track the actuals more closely.
In this practice session, we applied various techniques to evaluate time series forecasts. We calculated common error metrics (MAE, MSE, RMSE, MAPE), considered information criteria (AIC, BIC) obtained during model fitting, and visualized the performance by plotting forecasts against actual values. By synthesizing these different pieces of information, you can make an informed decision about which model provides the most accurate and reliable forecasts for your specific time series problem. Remember that model selection often involves trade-offs, and the "best" model depends on your specific goals and the characteristics of your data. Consistent evaluation is an essential part of the time series forecasting workflow.
© 2025 ApX Machine Learning