Okay, you've successfully fitted an ARIMA model to your time series data using statsmodels
. That's a significant step! However, before you rush off to generate forecasts, it's essential to pause and evaluate how well your model actually fits the data. Fitting a model is relatively easy; ensuring it's a good model requires some diagnostic work. This is where residual analysis comes in.
Residuals are simply the differences between the observed values and the values predicted by your model for the training data. For a time t, the residual et is calculated as:
et=yt−y^twhere yt is the actual observed value at time t, and y^t is the value fitted by your ARIMA model at time t.
Think of residuals as the "leftovers" – the part of the data that the model couldn't explain. If your ARIMA model has effectively captured the underlying structure (the autoregressive and moving average components, and addressed non-stationarity through differencing), the residuals should ideally behave like random noise. Specifically, they should resemble white noise.
A well-fitted ARIMA model should produce residuals that satisfy these conditions:
The statsmodels
library makes performing these diagnostic checks straightforward. When you fit an ARIMA model, the results object contains valuable information and methods for diagnostics.
Let's assume you have fitted a model like this:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
# Assume 'data' is your pandas Series (stationary or differenced)
# Assume 'p', 'd', 'q' are your chosen orders
# Example fitting process (replace with your actual data and orders)
# model = ARIMA(your_endog_data, order=(p, d, q)) # Use if d > 0
# model_fit = model.fit()
# Let's create some dummy results for demonstration
# Replace this with your actual model fitting
np.random.seed(42)
dummy_data = sm.tsa.arma_generate_sample(ar=[0.75], ma=[0], nsample=200)
dummy_series = pd.Series(dummy_data)
model = ARIMA(dummy_series, order=(1, 0, 0)) # Fit an AR(1)
model_fit = model.fit()
# The model_fit object now holds the results and diagnostics tools
print(model_fit.summary())
The model_fit.summary()
output itself provides some initial diagnostic information, often including:
Prob(Q)
value (p-value) should ideally be greater than 0.05, suggesting no significant autocorrelation at that lag.Prob(JB)
value (p-value) should ideally be greater than 0.05, supporting the null hypothesis that residuals are normally distributed.While the summary is useful, visual diagnostics are often more informative. statsmodels
provides a convenient method called plot_diagnostics
.
import matplotlib.pyplot as plt
# Generate standard diagnostic plots
fig = model_fit.plot_diagnostics(figsize=(12, 8))
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()
This command typically generates a 2x2 grid of plots:
Standardized Residuals Plot: Shows the residuals plotted over time. Look for:
Histogram plus Estimated Density: Shows the distribution of the residuals compared to a Normal distribution (often with a KDE - Kernel Density Estimate). Look for:
Normal Q-Q Plot: Compares the quantiles of the residual distribution to the quantiles of a standard normal distribution. Look for:
Correlogram (ACF Plot): Shows the Autocorrelation Function (ACF) of the residuals. Look for:
k
in the ACF/PACF of the residuals, try increasing the AR or MA order corresponding to that lag (e.g., add an AR(k) or MA(k) term).m
, 2m
, etc.), you likely need a SARIMA model (covered in the next chapter).Model diagnostics is an iterative process. You fit a model, check the residuals, and if they indicate problems, you adjust the model (e.g., change the order p,d,q, apply transformations) and repeat the process until the residuals appear sufficiently random and structureless. Only then can you have confidence in the forecasts generated by your model.
© 2025 ApX Machine Learning