While metrics like MAE, MSE, and RMSE tell us how well our model's forecasts match the actual values on the test set, they don't inherently penalize model complexity. A very complex model (e.g., an ARIMA model with many parameters) might fit the training data extremely well, perhaps even too well, leading to low errors on that data. However, this complexity might mean the model has captured noise rather than the underlying signal, resulting in poor generalization to new, unseen data. This phenomenon is known as overfitting.
Information criteria, like the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), offer a way to select models by balancing goodness of fit with model simplicity. They provide a relative measure of information lost when a given model is used to represent the process that generates the data.
The AIC was developed by Hirotugu Akaike and provides an estimate of the prediction error and thereby the relative quality of statistical models for a given set of data. AIC estimates the Kullback-Leibler divergence between the model and the true underlying process.
The formula for AIC is generally given as:
AIC=2k−2ln(L^)Where:
The first term, 2k, penalizes the model for having more parameters. The more parameters a model has, the higher this term, and thus the higher the AIC value. The second term, −2ln(L^), rewards the model for goodness of fit. A higher likelihood value L^ (indicating a better fit to the data) results in a smaller value for this term, thus decreasing the AIC.
The goal is to find a model that minimizes the AIC. Lower AIC values suggest a better balance between model fit and complexity. It's important to remember that AIC values are relative; an AIC value for a specific model is not meaningful on its own but becomes useful when comparing different candidate models fitted to the same dataset.
The Bayesian Information Criterion (BIC), also known as the Schwarz Criterion (SIC), is another criterion for model selection based on the likelihood function, but it imposes a stronger penalty for model complexity than AIC.
The formula for BIC is:
BIC=ln(n)k−2ln(L^)Where:
Comparing BIC to AIC, you can see the penalty term for the number of parameters k is ln(n)k instead of 2k. Since the natural logarithm of the sample size, ln(n), is typically greater than 2 for most practical time series (n>e2≈7.4), BIC generally penalizes complex models more heavily than AIC. This often leads BIC to select simpler models compared to AIC.
Like AIC, BIC aims to balance fit and complexity, and the model with the lowest BIC value is preferred. It's also a relative measure used for comparing models fitted to the same data.
When fitting ARIMA or SARIMA models using libraries like statsmodels
in Python, the summary output usually includes the AIC and BIC values.
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
# Assume 'data' is your stationary time series (e.g., after differencing)
# Example: Fit an ARIMA(1,1,1) model
# Note: The actual fitting would usually be on the original data,
# specifying d=1, but for AIC/BIC demonstration, let's assume
# data is already stationary and we are comparing ARMA(p,q) orders.
# For real use, fit ARIMA(p,d,q) to the original series.
# Generate some sample data for demonstration
import numpy as np
np.random.seed(42)
n_samples = 100
ar_params = np.array([0.6])
ma_params = np.array([-0.4])
# Add differencing for ARIMA(1,1,1) structure
y = sm.tsa.arima_process.arma_generate_sample(ar=ar_params, ma=ma_params, nsample=n_samples)
y = np.cumsum(y) # Integrate back to simulate non-stationarity
time_series = pd.Series(y)
# Fit candidate models
model_111 = ARIMA(time_series, order=(1, 1, 1)).fit()
model_211 = ARIMA(time_series, order=(2, 1, 1)).fit()
model_112 = ARIMA(time_series, order=(1, 1, 2)).fit()
# Get AIC and BIC values
print("ARIMA(1,1,1):")
print(f" AIC: {model_111.aic:.2f}")
print(f" BIC: {model_111.bic:.2f}")
print("\nARIMA(2,1,1):")
print(f" AIC: {model_211.aic:.2f}")
print(f" BIC: {model_211.bic:.2f}")
print("\nARIMA(1,1,2):")
print(f" AIC: {model_112.aic:.2f}")
print(f" BIC: {model_112.bic:.2f}")
# --- Example Output ---
# ARIMA(1,1,1):
# AIC: 278.20
# BIC: 285.99
#
# ARIMA(2,1,1):
# AIC: 279.98
# BIC: 290.36
#
# ARIMA(1,1,2):
# AIC: 280.02
# BIC: 290.40
In this hypothetical example, the ARIMA(1,1,1) model has the lowest AIC and BIC values among the three candidates. This suggests it provides the best balance of fit and parsimony for this specific dataset, according to these criteria.
We can visualize this comparison:
Comparison of AIC and BIC values for three candidate ARIMA models. Lower values indicate a better trade-off between model fit and complexity.
There's no definitive answer on whether AIC or BIC is universally "better."
In practice, it's often useful to look at both AIC and BIC. If they agree on the best model, it increases confidence in the choice. If they disagree, it highlights the trade-off: the AIC-preferred model might offer slightly better fit, while the BIC-preferred model is more parsimonious. The final choice might depend on your specific goals and potentially further analysis, such as examining the residuals (covered in previous chapters) and evaluating forecast accuracy on a hold-out set (using metrics like MAE, RMSE).
Remember, AIC and BIC are tools to guide model selection, particularly helpful when comparing multiple plausible ARIMA/SARIMA orders identified through ACF/PACF analysis. They should be used alongside, not instead of, residual diagnostics and out-of-sample forecast evaluation.
© 2025 ApX Machine Learning