这里逐步介绍使用Python构建SARIMA模型的过程。将使用一个具有明显季节性的经典数据集来说明相关步骤,包括初步数据检查和准备、模型拟合、诊断和预测。本动手练习假定您已拥有一个可用的Python环境,并安装了pandas、numpy、matplotlib和statsmodels等库。1. 设置和数据加载首先,让我们导入所需的库并加载数据集。我们将使用著名的“航空乘客”数据集,它包含1949年至1960年国际航空乘客的月度总数。该数据集可通过statsmodels方便获取。import pandas as pd import numpy as np import matplotlib.pyplot as plt from statsmodels.tsa.statespace.sarimax import SARIMAX from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.stattools import adfuller from statsmodels.tsa.seasonal import seasonal_decompose # 加载数据集 # 注意:如果尚未安装,您可能需要安装statsmodels数据集 # 示例使用常见来源或假设数据已加载到DataFrame 'df'中 # 为了演示,我们假设有一个DataFrame `df`,其中包含“Date”列 # 和“Passengers”列。 # 示例数据加载(请替换为您的实际数据加载) data = pd.read_csv('https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv', parse_dates=['Month'], index_col='Month') data.rename(columns={'#Passengers': 'Passengers'}, inplace=True) df = data.copy() print(df.head()) print(f"\n数据形状: {df.shape}") # 可视化数据 plt.figure(figsize=(12, 6)) plt.plot(df.index, df['Passengers'], label='每月乘客数') plt.title('航空乘客随时间变化') plt.xlabel('日期') plt.ylabel('乘客数量') plt.legend() plt.grid(True) plt.show();该图清楚地显示了上升趋势(随时间乘客增多)和明显的季节性模式(峰谷每年重复)。随时间增加的方差也表明存在乘法季节性,尽管本例我们将继续使用标准差分。2. 平稳性检查和差分如第2章所述,基于ARIMA的模型要求时间序列是平稳的。我们使用增广迪基-富勒(ADF)检验来检查平稳性。# ADF检验函数 def adf_test(series, title=''): """执行ADF检验并打印结果""" print(f'增广迪基-富勒检验: {title}') result = adfuller(series.dropna(), autolag='AIC') # .dropna() 处理差分序列 labels = ['ADF统计量', 'p值', '使用的滞后数', '使用的观察值数量'] out = pd.Series(result[0:4], index=labels) for key, val in result[4].items(): out[f'临界值 ({key})'] = val print(out.to_string()) if result[1] <= 0.05: print("=> 结论:拒绝原假设(数据是平稳的)") else: print("=> 结论:未能拒绝原假设(数据是非平稳的)") # 检验原始数据 adf_test(df['Passengers'], '原始数据')对原始数据进行的ADF检验很可能显示高p值(远大于0.05),表明数据非平稳。我们需要进行差分。考虑到趋势和季节性,我们可能需要非季节性差分($d=1$)和季节性差分($D=1$)。月度数据的季节周期 $m$ 是12。# 应用非季节性差分 (d=1) df['Passengers_diff'] = df['Passengers'].diff(1) # 对非季节性差分后的数据应用季节性差分 (D=1, m=12) df['Passengers_diff_seasonal'] = df['Passengers_diff'].diff(12) # 绘制差分数据 plt.figure(figsize=(12, 6)) plt.plot(df.index, df['Passengers_diff_seasonal'], label='差分后数据 (d=1, D=1, m=12)') plt.title('差分后的航空乘客数据') plt.xlabel('日期') plt.ylabel('差分值') plt.legend() plt.grid(True) plt.show(); # 检验差分数据的平稳性 adf_test(df['Passengers_diff_seasonal'], '季节性差分数据 (d=1, D=1, m=12)')在应用了一阶非季节性差分($d=1$)和一阶季节性差分($D=1, m=12$)后,图表应看起来更平稳(均值回归到零附近),并且ADF检验现在应产生小于或等于0.05的p值,从而确认平稳性。这表明我们可能会在SARIMA模型中使用$d=1$和$D=1$。3. 确定模型阶数 (ACF/PACF)现在,我们绘制平稳(差分后)序列的ACF和PACF图,以帮助确定非季节性阶数($p, q$)和季节性阶数($P, Q$)。# 绘制差分数据的ACF和PACF fig, axes = plt.subplots(1, 2, figsize=(16, 5)) # ACF图 plot_acf(df['Passengers_diff_seasonal'].dropna(), ax=axes[0], lags=40) axes[0].set_title('差分数据的ACF') # PACF图 plot_pacf(df['Passengers_diff_seasonal'].dropna(), ax=axes[1], lags=40, method='ols') # PACF推荐使用'ols'方法 axes[1].set_title('差分数据的PACF') plt.tight_layout() plt.show()解读这些图表:非季节性阶数 (p, q): 查看前几个滞后(1、2、3...)。ACF中在滞后 $q$ 后截断的明显峰值表明是MA(q)过程。PACF中在滞后 $p$ 后截断的明显峰值表明是AR(p)过程。如果ACF和PACF都逐渐拖尾,则表明是ARMA(p, q)过程。在我们的例子中,我们可能会看到ACF在滞后1处有一个明显峰值,表明 $q=1$。PACF可能在滞后1处也显示一个峰值,表明 $p=1$。我们从 $p=1, q=1$ 开始尝试。季节性阶数 (P, Q): 查看对应季节周期(12、24、36...)的滞后。ACF在滞后 $m$(例如12)处截断的明显峰值表明是季节性MA(Q)过程,其中 $Q=1$。PACF在滞后 $m$ 处截断的明显峰值表明是季节性AR(P)过程,其中 $P=1$。查看滞后12,我们可能会观察到ACF和PACF中都有一个明显的负向峰值。ACF在滞后12处的明显峰值表明 $Q=1$。PACF在滞后12处的明显峰值表明 $P=1$。我们暂且尝试 $P=1, Q=1$。基于此分析,我们模型的合理起点是 $ SARIMA(p=1, d=1, q=1)(P=1, D=1, Q=1)_{m=12} $。4. 拟合SARIMA模型我们现在可以使用statsmodels拟合SARIMA模型。请记住,将模型拟合到原始数据,并在模型参数中指定差分阶数($d$和$D$)。# 定义模型阶数 # 非季节性阶数 (p, d, q) order = (1, 1, 1) # 季节性阶数 (P, D, Q, m) seasonal_order = (1, 1, 1, 12) # 创建并拟合SARIMA模型 # 注意:使用 enforce_stationarity=False 和 enforce_invertibility=False # 有时有助于收敛,但请谨慎使用。默认为True。 model = SARIMAX(df['Passengers'], order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False) results = model.fit(disp=False) # disp=False 隐藏收敛信息 # 打印模型摘要 print(results.summary())摘要提供了有用的信息,包括系数估计、标准误差、p值以及AIC和BIC等信息准则。检查系数是否具有统计意义(p值 < 0.05)。5. 模型诊断检查模型残差是否接近白噪声很重要。statsmodels为此提供了一个方便的绘图函数。# 绘制诊断图 results.plot_diagnostics(figsize=(12, 8)) plt.tight_layout() plt.show()检查诊断图:标准化残差(左上): 应在零附近波动,方差均匀,没有明显模式。直方图加估计密度(右上): 直方图理想情况下应与正态分布曲线(N(0,1))匹配。KDE曲线应接近正态分布线。正态Q-Q图(左下): 大多数点应位于红色直线上,表明正态性。尾部的偏差很常见。相关图(残差的ACF图,右下): 在滞后0处不应有明显的自相关峰值。明显的峰值表明模型尚未捕捉所有时间依赖性。如果诊断结果合理(残差看起来是白噪声),则模型可能足够。如果不是,您可能需要重新检查模型阶数($p, d, q, P, D, Q$)或考虑其他模型变体。6. 预测最后,让我们使用已拟合的模型来预测未来值。我们将预测未来36个月(3年)。# 定义预测步长 forecast_steps = 36 # 获取预测对象 forecast_obj = results.get_forecast(steps=forecast_steps) # 提取预测均值 forecast_mean = forecast_obj.predicted_mean # 提取置信区间 confidence_intervals = forecast_obj.conf_int(alpha=0.05) # 95% 置信区间 # 创建预测的日期范围 last_date = df.index[-1] forecast_index = pd.date_range(start=last_date + pd.DateOffset(months=1), periods=forecast_steps, freq='MS') # 'MS' 表示月初频率 # --- 创建 Plotly 图表 --- # 准备 Plotly 数据 plot_data = [ {"x": df.index.strftime('%Y-%m-%d').tolist(), "y": df['Passengers'].tolist(), "mode": "lines", "name": "观测值", "line": {"color": "#228be6"}}, {"x": forecast_index.strftime('%Y-%m-%d').tolist(), "y": forecast_mean.tolist(), "mode": "lines", "name": "预测值", "line": {"color": "#f03e3e", "dash": "dash"}}, {"x": forecast_index.strftime('%Y-%m-%d').tolist() + forecast_index.strftime('%Y-%m-%d').tolist()[::-1], # 填充区域的 x 坐标 "y": confidence_intervals.iloc[:, 1].tolist() + confidence_intervals.iloc[:, 0].tolist()[::-1], # 上界 + 下界反转 "fill": "toself", "fillcolor": "rgba(250, 82, 82, 0.2)", "line": {"color": "rgba(255, 255, 255, 0)"}, "name": "95% 置信区间", "hoverinfo": "skip"} # 浅红色填充,无边框线 ] plot_layout = { "title": "SARIMA航空乘客预测", "xaxis": {"title": "日期"}, "yaxis": {"title": "乘客数量"}, "legend": {"x": 0.01, "y": 0.99}, "hovermode": "x unified" } ```plotly {"layout": {"title": "SARIMA航空乘客预测", "xaxis": {"title": "日期", "type": "date"}, "yaxis": {"title": "乘客数量"}, "legend": {"x": 0.01, "y": 0.99}, "hovermode": "x unified"}, "data": [{"x": ["1949-01-01", "1949-02-01", "1949-03-01", "1949-04-01", "1949-05-01", "1949-06-01", "1949-07-01", "1949-08-01", "1949-09-01", "1949-10-01", "1949-11-01", "1949-12-01", "1950-01-01", "1950-02-01", "1950-03-01", "1950-04-01", "1950-05-01", "1950-06-01", "1950-07-01", "1950-08-01", "1950-09-01", "1950-10-01", "1950-11-01", "1950-12-01", "1951-01-01", "1951-02-01", "1951-03-01", "1951-04-01", "1951-05-01", "1951-06-01", "1951-07-01", "1951-08-01", "1951-09-01", "1951-10-01", "1951-11-01", "1951-12-01", "1952-01-01", "1952-02-01", "1952-03-01", "1952-04-01", "1952-05-01", "1952-06-01", "1952-07-01", "1952-08-01", "1952-09-01", "1952-10-01", "1952-11-01", "1952-12-01", "1953-01-01", "1953-02-01", "1953-03-01", "1953-04-01", "1953-05-01", "1953-06-01", "1953-07-01", "1953-08-01", "1953-09-01", "1953-10-01", "1953-11-01", "1953-12-01", "1954-01-01", "1954-02-01", "1954-03-01", "1954-04-01", "1954-05-01", "1954-06-01", "1954-07-01", "1954-08-01", "1954-09-01", "1954-10-01", "1954-11-01", "1954-12-01", "1955-01-01", "1955-02-01", "1955-03-01", "1955-04-01", "1955-05-01", "1955-06-01", "1955-07-01", "1955-08-01", "1955-09-01", "1955-10-01", "1955-11-01", "1955-12-01", "1956-01-01", "1956-02-01", "1956-03-01", "1956-04-01", "1956-05-01", "1956-06-01", "1956-07-01", "1956-08-01", "1956-09-01", "1956-10-01", "1956-11-01", "1956-12-01", "1957-01-01", "1957-02-01", "1957-03-01", "1957-04-01", "1957-05-01", "1957-06-01", "1957-07-01", "1957-08-01", "1957-09-01", "1957-10-01", "1957-11-01", "1957-12-01", "1958-01-01", "1958-02-01", "1958-03-01", "1958-04-01", "1958-05-01", "1958-06-01", "1958-07-01", "1958-08-01", "1958-09-01", "1958-10-01", "1958-11-01", "1958-12-01", "1959-01-01", "1959-02-01", "1959-03-01", "1959-04-01", "1959-05-01", "1959-06-01", "1959-07-01", "1959-08-01", "1959-09-01", "1959-10-01", "1959-11-01", "1959-12-01", "1960-01-01", "1960-02-01", "1960-03-01", "1960-04-01", "1960-05-01", "1960-06-01", "1960-07-01", "1960-08-01", "1960-09-01", "1960-10-01", "1960-11-01", "1960-12-01"], "y": [112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, 115, 126, 141, 135, 125, 149, 170, 170, 158, 133, 114, 140, 145, 150, 178, 163, 172, 178, 199, 199, 184, 162, 146, 166, 171, 180, 193, 181, 183, 218, 230, 242, 209, 191, 172, 194, 196, 196, 236, 235, 229, 243, 264, 272, 237, 211, 180, 201, 204, 188, 235, 227, 234, 264, 302, 293, 259, 229, 203, 229, 242, 233, 267, 269, 270, 315, 364, 347, 312, 274, 237, 278, 284, 277, 317, 313, 318, 374, 413, 405, 355, 306, 271, 306, 315, 301, 356, 348, 355, 422, 465, 467, 404, 347, 305, 336, 340, 318, 362, 348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432], "mode": "lines", "name": "观测值", "line": {"color": "#228be6"}}, {"x": ["1961-01-01", "1961-02-01", "1961-03-01", "1961-04-01", "1961-05-01", "1961-06-01", "1961-07-01", "1961-08-01", "1961-09-01", "1961-10-01", "1961-11-01", "1961-12-01", "1962-01-01", "1962-02-01", "1962-03-01", "1962-04-01", "1962-05-01", "1962-06-01", "1962-07-01", "1962-08-01", "1962-09-01", "1962-10-01", "1962-11-01", "1962-12-01", "1963-01-01", "1963-02-01", "1963-03-01", "1963-04-01", "1963-05-01", "1963-06-01", "1963-07-01", "1963-08-01", "1963-09-01", "1963-10-01", "1963-11-01", "1963-12-01"], "y": [448.0107, 421.0543, 459.0079, 496.8253, 506.9686, 573.1011, 665.2013, 651.1866, 553.9273, 498.9618, 424.4382, 469.9089, 486.4311, 459.4636, 497.4032, 535.2090, 545.3428, 611.4669, 703.5612, 689.5426, 592.2800, 537.3121, 462.7870, 508.2568, 524.7783, 497.8102, 535.7493, 573.5547, 583.6882, 649.8120, 741.9059, 727.8871, 630.6242, 575.6560, 501.1308, 546.5999], "mode": "lines", "name": "预测值", "line": {"color": "#f03e3e", "dash": "dash"}}, {"x": ["1961-01-01", "1961-02-01", "1961-03-01", "1961-04-01", "1961-05-01", "1961-06-01", "1961-07-01", "1961-08-01", "1961-09-01", "1961-10-01", "1961-11-01", "1961-12-01", "1962-01-01", "1962-02-01", "1962-03-01", "1962-04-01", "1962-05-01", "1962-06-01", "1962-07-01", "1962-08-01", "1962-09-01", "1962-10-01", "1962-11-01", "1962-12-01", "1963-01-01", "1963-02-01", "1963-03-01", "1963-04-01", "1963-05-01", "1963-06-01", "1963-07-01", "1963-08-01", "1963-09-01", "1963-10-01", "1963-11-01", "1963-12-01", "1963-12-01", "1963-11-01", "1963-10-01", "1963-09-01", "1963-08-01", "1963-07-01", "1963-06-01", "1963-05-01", "1963-04-01", "1963-03-01", "1963-02-01", "1963-01-01", "1962-12-01", "1962-11-01", "1962-10-01", "1962-09-01", "1962-08-01", "1962-07-01", "1962-06-01", "1962-05-01", "1962-04-01", "1962-03-01", "1962-02-01", "1962-01-01", "1961-12-01", "1961-11-01", "1961-10-01", "1961-09-01", "1961-08-01", "1961-07-01", "1961-06-01", "1961-05-01", "1961-04-01", "1961-03-01", "1961-02-01", "1961-01-01"], "y": [484.9134, 463.1606, 506.1778, 549.0953, 563.3422, 633.7900, 730.4556, 720.0402, 626.1520, 574.5574, 503.3540, 551.9914, 571.0951, 545.8782, 585.2252, 624.2841, 635.5849, 702.8583, 796.1106, 783.2313, 687.0546, 633.1189, 559.5742, 605.9776, 623.0591, 596.6416, 635.0931, 673.3903, 684.0015, 750.5905, 843.1487, 829.5942, 732.7961, 678.2917, 604.1970, 650.0358, 443.1641, 398.0646, 472.8203, 528.0522, 626.1630, 646.0474, 578.4535, 552.1759, 513.4189, 526.8964, 507.0188, 466.4975, 410.5360, 369.5996, 441.4553, 460.9056, 605.3620, 616.1449, 549.1338, 499.7476, 455.1823, 461.5732, 478.7712, 452.6269, 410.1082, 376.1368, 423.0761, 418.3862, 399.9566, 431.8380, 444.7253, 449.5950, 512.4123, 599.9469, 582.3331, 481.7027, 423.3663, 345.5223, 387.8264], "fill": "toself", "fillcolor": "rgba(250, 82, 82, 0.2)", "line": {"color": "rgba(255, 255, 255, 0)"}, "name": "95% 置信区间", "hoverinfo": "skip"}]}观测到的航空乘客数量、未来36个月的SARIMA预测以及相应的95%置信区间。该图显示了原始数据以及模型的预测值及其相关的不确定性(置信区间通常随着预测期延长而变宽)。预测应在视觉上延续历史数据中观察到的趋势和季节性模式。总结在本实践部分,我们成功为季节性时间序列构建了一个SARIMA模型。主要步骤包括:加载并可视化数据,以识别趋势和季节性。使用ADF检验检查平稳性,并应用必要的差分(包括非季节性和季节性)。在平稳数据上使用ACF和PACF图来指导SARIMA阶数 $(p, d, q)(P, D, Q)_m$ 的选择。使用statsmodels拟合SARIMAX模型。通过残差诊断评估模型拟合度。生成带有置信区间的未来预测。请记住,找到最佳SARIMA模型通常需要一些迭代。您可能根据诊断结果或信息准则(如模型摘要中的AIC/BIC)尝试略微不同的阶数,并比较它们的性能,这是下一章模型评估的重点。