Alright, let's put the theory into practice. We'll walk through a complete example of building an ARIMA model, starting with data preparation and ending with generating forecasts. We'll use a synthetic dataset generated to follow a known ARIMA process, which helps verify our steps.
First, ensure you have the necessary libraries installed and import them:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
# Configure plot style (optional)
plt.style.use('seaborn-v0_8-whitegrid')
For this exercise, we'll generate data that follows an ARIMA(1,1,1) process. This means the first difference of the series follows an ARMA(1,1) process. Knowing the underlying process allows us to check if our analysis leads us back to the correct model structure.
# Define ARIMA(1,1,1) parameters
np.random.seed(42) # for reproducibility
n_sample = 250
ar_params = np.array([0.6]) # AR(1) coefficient for the ARMA part
ma_params = np.array([0.4]) # MA(1) coefficient for the ARMA part
ar = np.r_[1, -ar_params] # Lag polynomial representation for AR
ma = np.r_[1, ma_params] # Lag polynomial representation for MA
# Generate the stationary ARMA(1,1) component
arma_process = ArmaProcess(ar, ma)
arma_data = arma_process.generate_sample(nsample=n_sample)
# Integrate the ARMA data to get ARIMA(1,1,1)
# Integration (d=1) is achieved by taking the cumulative sum
y = np.cumsum(arma_data) + 100 # Add a constant to simulate a starting level
# Create a pandas Series with a DatetimeIndex
timestamps = pd.date_range(start='2021-01-01', periods=n_sample, freq='D')
ts_data = pd.Series(y, index=timestamps, name='Value')
print("First 5 data points:\n", ts_data.head())
Let's visualize the generated series and check for stationarity.
# Plot the original time series
plt.figure(figsize=(10, 4))
plt.plot(ts_data, color='#1c7ed6')
plt.title('Generated ARIMA(1,1,1) Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
The generated time series data exhibits fluctuations but doesn't show a clear constant mean, suggesting non-stationarity.
The plot shows a wandering behavior, characteristic of a random walk or integrated process. Let's confirm this with the Augmented Dickey-Fuller (ADF) test. The null hypothesis (H0) for the ADF test is that the series has a unit root (is non-stationary).
# Perform ADF test on the original series
adf_result = adfuller(ts_data)
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')
print('Critical Values:')
for key, value in adf_result[4].items():
print(f'\t{key}: {value:.4f}')
if adf_result[1] > 0.05:
print("\nResult: The series is likely non-stationary (fail to reject H0).")
else:
print("\nResult: The series is likely stationary (reject H0).")
The ADF test result likely shows a high p-value (e.g., > 0.05), confirming our visual suspicion of non-stationarity. We need to difference the series to make it stationary. Since we suspect d=1 (based on how we generated the data and the plot), let's take the first difference.
# Calculate the first difference
ts_diff = ts_data.diff().dropna() # dropna() removes the first NaN value
# Plot the differenced series
plt.figure(figsize=(10, 4))
plt.plot(ts_diff, color='#40c057')
plt.title('First Difference of the Time Series')
plt.xlabel('Date')
plt.ylabel('Differenced Value')
plt.show()
# Perform ADF test on the differenced series
adf_result_diff = adfuller(ts_diff)
print("\nADF Test on Differenced Series:")
print(f'ADF Statistic: {adf_result_diff[0]:.4f}')
print(f'p-value: {adf_result_diff[1]:.4f}')
if adf_result_diff[1] <= 0.05:
print("Result: The differenced series is likely stationary (reject H0).")
else:
print("Result: The differenced series is likely non-stationary (fail to reject H0).")
The plot of the differenced series should now look more stationary, fluctuating around a constant mean (near zero) with constant variance. The ADF test on ts_diff
should yield a very small p-value (e.g., < 0.01), strongly suggesting stationarity. This confirms that d=1 is appropriate for our ARIMA model.
Now we analyze the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of the stationary differenced series (ts_diff
) to estimate the AR order (p) and MA order (q).
# Plot ACF and PACF of the differenced series
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# ACF Plot
plot_acf(ts_diff, ax=axes[0], lags=20, color='#228be6', vlines_kwargs={"colors": '#228be6'})
axes[0].set_title('Autocorrelation Function (ACF)')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('ACF')
# PACF Plot
plot_pacf(ts_diff, ax=axes[1], lags=20, method='ywm', color='#15aabf', vlines_kwargs={"colors": '#15aabf'})
axes[1].set_title('Partial Autocorrelation Function (PACF)')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('PACF')
plt.tight_layout()
plt.show()
Interpreting the Plots:
Based on this analysis, a plausible model order is (p,d,q)=(1,1,1). This matches the parameters we used to generate the data.
Now we fit an ARIMA(1,1,1) model to the original time series data (ts_data
). The statsmodels
library handles the differencing internally when you specify d=1.
# Define the ARIMA model order
p, d, q = 1, 1, 1
# Create and fit the ARIMA model
# Note: We fit on the original ts_data, specifying d=1
model = ARIMA(ts_data, order=(p, d, q))
model_fit = model.fit()
# Print the model summary
print(model_fit.summary())
The summary provides a wealth of information:
ar_params
(0.6) and ma_params
(0.4) we used for generation. They should be reasonably close.A good ARIMA model should have residuals that resemble white noise: zero mean, constant variance, and no autocorrelation.
# Get model residuals
residuals = model_fit.resid
# Plot residuals
plt.figure(figsize=(10, 4))
plt.plot(residuals, color='#be4bdb')
plt.title('ARIMA(1,1,1) Model Residuals')
plt.xlabel('Date')
plt.ylabel('Residual Value')
plt.show()
# Plot ACF and PACF of residuals
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(residuals, ax=axes[0], lags=20, color='#f76707', vlines_kwargs={"colors": '#f76707'})
axes[0].set_title('Residuals ACF')
plot_pacf(residuals, ax=axes[1], lags=20, method='ywm', color='#f59f00', vlines_kwargs={"colors": '#f59f00'})
axes[1].set_title('Residuals PACF')
plt.tight_layout()
plt.show()
The residuals plot should hover around zero without obvious patterns. The ACF and PACF plots of the residuals should ideally show no significant spikes outside the confidence intervals for lags > 0. The Ljung-Box test result (Prob(Q)) in the model summary should have a p-value greater than 0.05, indicating no significant autocorrelation left in the residuals. The Jarque-Bera test (Prob(JB)) checks normality; if the p-value is low, the residuals might not be normally distributed, but ARIMA can sometimes be robust to mild deviations.
If the diagnostics look good, we can proceed to forecasting. If not, you might need to reconsider the model order (p, d, q) or check if other factors (like seasonality, discussed in the next chapter) are present.
Let's use the fitted model to make predictions. We can generate forecasts for future time points beyond the end of our original data.
# Define the number of steps to forecast ahead
n_forecast_steps = 30
# Generate forecasts
forecast_result = model_fit.get_forecast(steps=n_forecast_steps)
forecast_mean = forecast_result.predicted_mean
forecast_ci = forecast_result.conf_int(alpha=0.05) # 95% confidence interval
# Create date index for the forecast period
last_date = ts_data.index[-1]
forecast_index = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=n_forecast_steps, freq='D')
# Combine original data and forecast for plotting
plt.figure(figsize=(12, 5))
plt.plot(ts_data.index, ts_data, label='Observed Data', color='#495057')
plt.plot(forecast_index, forecast_mean, label='Forecast', color='#f03e3e')
plt.fill_between(forecast_index,
forecast_ci.iloc[:, 0], # Lower CI bound
forecast_ci.iloc[:, 1], # Upper CI bound
color='#ffc9c9', alpha=0.5, label='95% Confidence Interval')
plt.title('ARIMA(1,1,1) Forecast')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()
Forecasted values using the ARIMA(1,1,1) model, along with the 95% confidence interval, extending beyond the observed data period.
The plot displays the original data followed by the forecasted values. Notice how the confidence interval widens as the forecast horizon increases, reflecting growing uncertainty about the future. For an ARIMA(p,d,q) model with d≥1, long-term forecasts will tend towards a constant level (if d=0) or follow a linear trend (if d=1 and a constant/drift term is included).
This concludes our hands-on practice for building a basic ARIMA model. You've seen the workflow: check stationarity, difference if needed, use ACF/PACF to identify p and q, fit the model, check residuals, and finally generate forecasts. Remember that real-world data is often more complex, potentially requiring consideration of seasonality (covered next) or other external factors.
© 2025 ApX Machine Learning