Alright, let's put the theory into practice. In this section, we'll walk through the process of building a SARIMA model using Python. We'll use a classic dataset that exhibits strong seasonality to demonstrate the steps involved, from initial data inspection and preparation to model fitting, diagnostics, and forecasting.
This hands-on exercise assumes you have a working Python environment with libraries like pandas
, numpy
, matplotlib
, and statsmodels
installed.
First, let's import the necessary libraries and load our dataset. We'll use the well-known "Airline Passengers" dataset, which contains monthly totals of international airline passengers from 1949 to 1960. This dataset is readily available through statsmodels
.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
# Load the dataset
# Note: You might need to install statsmodels datasets if you haven't already
# Example using a common source or assuming data is loaded into a DataFrame 'df'
# For demonstration, let's assume we have a DataFrame `df` with a 'Date' column
# and a 'Passengers' column.
# Example data loading (replace with your actual data loading)
data = pd.read_csv('https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv',
parse_dates=['Month'], index_col='Month')
data.rename(columns={'#Passengers': 'Passengers'}, inplace=True)
df = data.copy()
print(df.head())
print(f"\nData shape: {df.shape}")
# Visualize the data
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Passengers'], label='Monthly Passengers')
plt.title('Airline Passengers over Time')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.legend()
plt.grid(True)
plt.show();
The plot clearly shows an upward trend (more passengers over time) and a strong seasonal pattern (peaks and troughs repeating annually). The increasing variance over time also suggests a multiplicative seasonality, though we'll proceed with standard differencing for this example.
As discussed in Chapter 2, ARIMA-based models require the time series to be stationary. Let's check for stationarity using the Augmented Dickey-Fuller (ADF) test.
# ADF Test Function
def adf_test(series, title=''):
"""Performs ADF test and prints results"""
print(f'Augmented Dickey-Fuller Test: {title}')
result = adfuller(series.dropna(), autolag='AIC') # .dropna() handles differenced series
labels = ['ADF Statistic', 'p-value', '# Lags Used', 'Number of Observations Used']
out = pd.Series(result[0:4], index=labels)
for key, val in result[4].items():
out[f'Critical Value ({key})'] = val
print(out.to_string())
if result[1] <= 0.05:
print("=> Conclusion: Reject the null hypothesis (Data is stationary)")
else:
print("=> Conclusion: Fail to reject the null hypothesis (Data is non-stationary)")
# Test original data
adf_test(df['Passengers'], 'Original Data')
The ADF test on the original data will likely show a high p-value (>> 0.05), indicating non-stationarity. We need differencing. Given the trend and seasonality, we'll likely need both non-seasonal differencing (d=1) and seasonal differencing (D=1). The seasonal period m for monthly data is 12.
# Apply non-seasonal differencing (d=1)
df['Passengers_diff'] = df['Passengers'].diff(1)
# Apply seasonal differencing (D=1, m=12) on the non-seasonally differenced data
df['Passengers_diff_seasonal'] = df['Passengers_diff'].diff(12)
# Plot the differenced data
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Passengers_diff_seasonal'], label='Differenced (d=1, D=1, m=12)')
plt.title('Differenced Airline Passengers Data')
plt.xlabel('Date')
plt.ylabel('Differenced Value')
plt.legend()
plt.grid(True)
plt.show();
# Test the differenced data for stationarity
adf_test(df['Passengers_diff_seasonal'], 'Seasonally Differenced Data (d=1, D=1, m=12)')
After applying both first-order non-seasonal differencing (d=1) and first-order seasonal differencing (D=1,m=12), the plot should look much more stationary (mean-reverting around zero), and the ADF test should now yield a p-value less than or equal to 0.05, confirming stationarity. This suggests we'll likely use d=1 and D=1 in our SARIMA model.
Now, we plot the ACF and PACF of the stationary (differenced) series to help determine the non-seasonal orders (p,q) and seasonal orders (P,Q).
# Plot ACF and PACF on the differenced data
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# ACF Plot
plot_acf(df['Passengers_diff_seasonal'].dropna(), ax=axes[0], lags=40)
axes[0].set_title('ACF of Differenced Data')
# PACF Plot
plot_pacf(df['Passengers_diff_seasonal'].dropna(), ax=axes[1], lags=40, method='ols') # 'ols' method recommended for PACF
axes[1].set_title('PACF of Differenced Data')
plt.tight_layout()
plt.show()
Interpreting these plots:
Non-Seasonal Orders (p, q): Look at the first few lags (1, 2, 3...).
Seasonal Orders (P, Q): Look at the lags corresponding to the seasonal period (12, 24, 36...).
Based on this analysis, a reasonable starting point for our model is SARIMA(p=1,d=1,q=1)(P=1,D=1,Q=1)m=12ā.
We can now fit the SARIMA model using statsmodels
. Remember to fit the model to the original data, specifying the differencing orders (d and D) within the model parameters.
# Define model order
# Non-seasonal order (p, d, q)
order = (1, 1, 1)
# Seasonal order (P, D, Q, m)
seasonal_order = (1, 1, 1, 12)
# Create and fit the SARIMA model
# Note: Using enforce_stationarity=False and enforce_invertibility=False
# can sometimes help convergence, but use with caution. Default is True.
model = SARIMAX(df['Passengers'],
order=order,
seasonal_order=seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False)
results = model.fit(disp=False) # disp=False hides convergence messages
# Print model summary
print(results.summary())
The summary provides valuable information, including coefficient estimates, standard errors, p-values, and information criteria like AIC and BIC. Check if the coefficients are statistically significant (p-value < 0.05).
It's important to check if the model residuals approximate white noise. statsmodels
provides a convenient plotting function for this.
# Plot diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.tight_layout()
plt.show()
Examine the diagnostic plots:
If the diagnostics look reasonable (residuals appear to be white noise), the model might be adequate. If not, you might need to revisit the model orders (p,d,q,P,D,Q) or consider other model variations.
Finally, let's use our fitted model to forecast future values. We'll forecast the next 36 months (3 years).
# Define forecast steps
forecast_steps = 36
# Get forecast object
forecast_obj = results.get_forecast(steps=forecast_steps)
# Extract predicted mean values
forecast_mean = forecast_obj.predicted_mean
# Extract confidence intervals
confidence_intervals = forecast_obj.conf_int(alpha=0.05) # 95% confidence interval
# Create date range for the forecast
last_date = df.index[-1]
forecast_index = pd.date_range(start=last_date + pd.DateOffset(months=1), periods=forecast_steps, freq='MS') # 'MS' for month start frequency
# --- Create Plotly Chart ---
# Prepare data for Plotly
plot_data = [
{"x": df.index.strftime('%Y-%m-%d').tolist(), "y": df['Passengers'].tolist(), "mode": "lines", "name": "Observed", "line": {"color": "#228be6"}},
{"x": forecast_index.strftime('%Y-%m-%d').tolist(), "y": forecast_mean.tolist(), "mode": "lines", "name": "Forecast", "line": {"color": "#f03e3e", "dash": "dash"}},
{"x": forecast_index.strftime('%Y-%m-%d').tolist() + forecast_index.strftime('%Y-%m-%d').tolist()[::-1], # x coordinates for fill
"y": confidence_intervals.iloc[:, 1].tolist() + confidence_intervals.iloc[:, 0].tolist()[::-1], # Upper bound + Lower bound reversed
"fill": "toself", "fillcolor": "rgba(250, 82, 82, 0.2)", "line": {"color": "rgba(255, 255, 255, 0)"}, "name": "95% Confidence Interval", "hoverinfo": "skip"} # Light red fill, no border line
]
plot_layout = {
"title": "Airline Passengers Forecast with SARIMA",
"xaxis": {"title": "Date"},
"yaxis": {"title": "Number of Passengers"},
"legend": {"x": 0.01, "y": 0.99},
"hovermode": "x unified"
}
```plotly
{"layout": {"title": "Airline Passengers Forecast with SARIMA", "xaxis": {"title": "Date", "type": "date"}, "yaxis": {"title": "Number of Passengers"}, "legend": {"x": 0.01, "y": 0.99}, "hovermode": "x unified"}, "data": [{"x": ["1949-01-01", "1949-02-01", "1949-03-01", "1949-04-01", "1949-05-01", "1949-06-01", "1949-07-01", "1949-08-01", "1949-09-01", "1949-10-01", "1949-11-01", "1949-12-01", "1950-01-01", "1950-02-01", "1950-03-01", "1950-04-01", "1950-05-01", "1950-06-01", "1950-07-01", "1950-08-01", "1950-09-01", "1950-10-01", "1950-11-01", "1950-12-01", "1951-01-01", "1951-02-01", "1951-03-01", "1951-04-01", "1951-05-01", "1951-06-01", "1951-07-01", "1951-08-01", "1951-09-01", "1951-10-01", "1951-11-01", "1951-12-01", "1952-01-01", "1952-02-01", "1952-03-01", "1952-04-01", "1952-05-01", "1952-06-01", "1952-07-01", "1952-08-01", "1952-09-01", "1952-10-01", "1952-11-01", "1952-12-01", "1953-01-01", "1953-02-01", "1953-03-01", "1953-04-01", "1953-05-01", "1953-06-01", "1953-07-01", "1953-08-01", "1953-09-01", "1953-10-01", "1953-11-01", "1953-12-01", "1954-01-01", "1954-02-01", "1954-03-01", "1954-04-01", "1954-05-01", "1954-06-01", "1954-07-01", "1954-08-01", "1954-09-01", "1954-10-01", "1954-11-01", "1954-12-01", "1955-01-01", "1955-02-01", "1955-03-01", "1955-04-01", "1955-05-01", "1955-06-01", "1955-07-01", "1955-08-01", "1955-09-01", "1955-10-01", "1955-11-01", "1955-12-01", "1956-01-01", "1956-02-01", "1956-03-01", "1956-04-01", "1956-05-01", "1956-06-01", "1956-07-01", "1956-08-01", "1956-09-01", "1956-10-01", "1956-11-01", "1956-12-01", "1957-01-01", "1957-02-01", "1957-03-01", "1957-04-01", "1957-05-01", "1957-06-01", "1957-07-01", "1957-08-01", "1957-09-01", "1957-10-01", "1957-11-01", "1957-12-01", "1958-01-01", "1958-02-01", "1958-03-01", "1958-04-01", "1958-05-01", "1958-06-01", "1958-07-01", "1958-08-01", "1958-09-01", "1958-10-01", "1958-11-01", "1958-12-01", "1959-01-01", "1959-02-01", "1959-03-01", "1959-04-01", "1959-05-01", "1959-06-01", "1959-07-01", "1959-08-01", "1959-09-01", "1959-10-01", "1959-11-01", "1959-12-01", "1960-01-01", "1960-02-01", "1960-03-01", "1960-04-01", "1960-05-01", "1960-06-01", "1960-07-01", "1960-08-01", "1960-09-01", "1960-10-01", "1960-11-01", "1960-12-01"], "y": [112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 115, 126, 141, 135, 125, 149, 170, 170, 158, 133, 114, 140, 145, 150, 178, 163, 172, 178, 199, 199, 184, 162, 146, 166, 171, 180, 193, 181, 183, 218, 230, 242, 209, 191, 172, 194, 196, 196, 236, 235, 229, 243, 264, 272, 237, 211, 180, 201, 204, 188, 235, 227, 234, 264, 302, 293, 259, 229, 203, 229, 242, 233, 267, 269, 270, 315, 364, 347, 312, 274, 237, 278, 284, 277, 317, 313, 318, 374, 413, 405, 355, 306, 271, 306, 315, 301, 356, 348, 355, 422, 465, 467, 404, 347, 305, 336, 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432], "mode": "lines", "name": "Observed", "line": {"color": "#228be6"}}, {"x": ["1961-01-01", "1961-02-01", "1961-03-01", "1961-04-01", "1961-05-01", "1961-06-01", "1961-07-01", "1961-08-01", "1961-09-01", "1961-10-01", "1961-11-01", "1961-12-01", "1962-01-01", "1962-02-01", "1962-03-01", "1962-04-01", "1962-05-01", "1962-06-01", "1962-07-01", "1962-08-01", "1962-09-01", "1962-10-01", "1962-11-01", "1962-12-01", "1963-01-01", "1963-02-01", "1963-03-01", "1963-04-01", "1963-05-01", "1963-06-01", "1963-07-01", "1963-08-01", "1963-09-01", "1963-10-01", "1963-11-01", "1963-12-01"], "y": [448.0107, 421.0543, 459.0079, 496.8253, 506.9686, 573.1011, 665.2013, 651.1866, 553.9273, 498.9618, 424.4382, 469.9089, 486.4311, 459.4636, 497.4032, 535.2090, 545.3428, 611.4669, 703.5612, 689.5426, 592.2800, 537.3121, 462.7870, 508.2568, 524.7783, 497.8102, 535.7493, 573.5547, 583.6882, 649.8120, 741.9059, 727.8871, 630.6242, 575.6560, 501.1308, 546.5999], "mode": "lines", "name": "Forecast", "line": {"color": "#f03e3e", "dash": "dash"}}, {"x": ["1961-01-01", "1961-02-01", "1961-03-01", "1961-04-01", "1961-05-01", "1961-06-01", "1961-07-01", "1961-08-01", "1961-09-01", "1961-10-01", "1961-11-01", "1961-12-01", "1962-01-01", "1962-02-01", "1962-03-01", "1962-04-01", "1962-05-01", "1962-06-01", "1962-07-01", "1962-08-01", "1962-09-01", "1962-10-01", "1962-11-01", "1962-12-01", "1963-01-01", "1963-02-01", "1963-03-01", "1963-04-01", "1963-05-01", "1963-06-01", "1963-07-01", "1963-08-01", "1963-09-01", "1963-10-01", "1963-11-01", "1963-12-01", "1963-12-01", "1963-11-01", "1963-10-01", "1963-09-01", "1963-08-01", "1963-07-01", "1963-06-01", "1963-05-01", "1963-04-01", "1963-03-01", "1963-02-01", "1963-01-01", "1962-12-01", "1962-11-01", "1962-10-01", "1962-09-01", "1962-08-01", "1962-07-01", "1962-06-01", "1962-05-01", "1962-04-01", "1962-03-01", "1962-02-01", "1962-01-01", "1961-12-01", "1961-11-01", "1961-10-01", "1961-09-01", "1961-08-01", "1961-07-01", "1961-06-01", "1961-05-01", "1961-04-01", "1961-03-01", "1961-02-01", "1961-01-01"], "y": [484.9134, 463.1606, 506.1778, 549.0953, 563.3422, 633.7900, 730.4556, 720.0402, 626.1520, 574.5574, 503.3540, 551.9914, 571.0951, 545.8782, 585.2252, 624.2841, 635.5849, 702.8583, 796.1106, 783.2313, 687.0546, 633.1189, 559.5742, 605.9776, 623.0591, 596.6416, 635.0931, 673.3903, 684.0015, 750.5905, 843.1487, 829.5942, 732.7961, 678.2917, 604.1970, 650.0358, 443.1641, 398.0646, 472.8203, 528.0522, 626.1630, 646.0474, 578.4535, 552.1759, 513.4189, 526.8964, 507.0188, 466.4975, 410.5360, 369.5996, 441.4553, 460.9056, 605.3620, 616.1449, 549.1338, 499.7476, 455.1823, 461.5732, 478.7712, 452.6269, 410.1082, 376.1368, 423.0761, 418.3862, 399.9566, 431.8380, 444.7253, 449.5950, 512.4123, 599.9469, 582.3331, 481.7027, 423.3663, 345.5223, 387.8264], "fill": "toself", "fillcolor": "rgba(250, 82, 82, 0.2)", "line": {"color": "rgba(255, 255, 255, 0)"}, "name": "95% Confidence Interval", "hoverinfo": "skip"}]}
Observed airline passenger counts, SARIMA forecast for the next 36 months, and the corresponding 95% confidence interval.
The plot displays the original data along with the model's forecast and the uncertainty associated with it (confidence intervals typically widen further into the future). The forecast should visually continue the trend and seasonal patterns observed in the historical data.
In this practice section, we successfully built a SARIMA model for a seasonal time series. The key steps included:
SARIMAX
model using statsmodels
.Remember that finding the optimal SARIMA model often involves some iteration. You might try slightly different orders based on the diagnostics or information criteria (like AIC/BIC from the model summary) and compare their performance, which is the focus of the next chapter on model evaluation.
Ā© 2025 ApX Machine Learning