Having loaded our time series data into Pandas, we often need to manipulate it based on its temporal structure. Standard operations might not capture the time dependencies inherent in the data. This section covers essential Pandas techniques for shifting data points in time, calculating differences between them, and applying calculations over moving windows. These operations are fundamental for feature engineering and preparing data for time series modeling.
.shift()
One of the most common operations is shifting the entire series forward or backward in time. This is useful for comparing observations to their predecessors or successors. Pandas provides the .shift()
method for this purpose.
Imagine you have a Series of daily stock prices. Shifting the data by 1 (.shift(1)
) moves each data point one step forward in the index, effectively aligning yt with yt−1. This is known as lagging the data. Conversely, shifting by -1 (.shift(-1)
) moves data one step backward, aligning yt with yt+1. This is sometimes called leading the data.
Let's see this in action. We'll create a simple time series first:
import pandas as pd
import numpy as np
# Create a sample time series
dates = pd.date_range(start='2023-01-01', periods=6, freq='M') # Using Month End frequency
data = pd.Series([10, 12, 15, 13, 16, 18], index=dates, name='Value')
print("Original Series:")
print(data)
# Shift forward by 1 period (Lag)
shifted_forward = data.shift(1)
print("\nShifted Forward (Lagged by 1):")
print(shifted_forward)
# Shift backward by 1 period (Lead)
shifted_backward = data.shift(-1)
print("\nShifted Backward (Lead by 1):")
print(shifted_backward)
Output:
Original Series:
2023-01-31 10
2023-02-28 12
2023-03-31 15
2023-04-30 13
2023-05-31 16
2023-06-30 18
Freq: M, Name: Value, dtype: int64
Shifted Forward (Lagged by 1):
2023-01-31 NaN
2023-02-28 10.0
2023-03-31 12.0
2023-04-30 15.0
2023-05-31 13.0
2023-06-30 16.0
Freq: M, Name: Value, dtype: float64
Shifted Backward (Lead by 1):
2023-01-31 12.0
2023-02-28 15.0
2023-03-31 13.0
2023-04-30 16.0
2023-05-31 18.0
2023-06-30 NaN
Freq: M, Name: Value, dtype: float64
Notice the NaN
(Not a Number) values introduced at the beginning or end of the shifted series. This happens because there's no prior value for the first observation when shifting forward, and no subsequent value for the last observation when shifting backward. Handling these missing values is an important consideration depending on the analysis.
Lagging (positive shifts) is particularly important in forecasting. Many models try to predict the next value (yt) based on previous values (yt−1,yt−2, etc.). Creating lagged versions of your target variable is a common way to generate features for these models.
.diff()
Another frequent requirement is to compute the difference between consecutive observations. This operation, known as differencing, helps stabilize the mean of a time series by removing trends or changes in level. It highlights the period-to-period changes rather than the absolute values. We will see its importance for achieving stationarity in Chapter 2.
Pandas offers the .diff()
method for this. By default, it calculates the difference between an element and the element immediately preceding it (yt−yt−1).
# Calculate the first difference
first_difference = data.diff()
print("\nFirst Difference (data - data.shift(1)):")
print(first_difference)
# Calculate the difference over 2 periods
second_period_difference = data.diff(periods=2)
print("\nDifference over 2 periods (data - data.shift(2)):")
print(second_period_difference)
Output:
First Difference (data - data.shift(1)):
2023-01-31 NaN
2023-02-28 2.0
2023-03-31 3.0
2023-04-30 -2.0
2023-05-31 3.0
2023-06-30 2.0
Freq: M, Name: Value, dtype: float64
Difference over 2 periods (data - data.shift(2)):
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 5.0
2023-04-30 1.0
2023-05-31 1.0
2023-06-30 5.0
Freq: M, Name: Value, dtype: float64
Again, NaN
values appear where the difference cannot be computed (at the start of the series). Note that data.diff(1)
is conceptually equivalent to data - data.shift(1)
. You can specify the periods
argument in .diff()
to calculate differences over longer intervals.
.rolling()
Time shifting and differencing compare specific points in time. Often, however, we want to understand trends or behavior over a period of time. Rolling window calculations allow us to compute statistics (like mean, standard deviation, sum) over a sliding window of a defined size.
This is commonly used for:
The .rolling()
method in Pandas creates a rolling window object. You specify the window
size (the number of observations included in each window). You then chain an aggregation method (like .mean()
, .std()
, .sum()
, .min()
, .max()
) to compute the desired statistic for each window.
# Calculate the 3-period rolling mean
rolling_mean = data.rolling(window=3).mean()
print("\n3-Period Rolling Mean:")
print(rolling_mean)
# Calculate the 2-period rolling standard deviation
rolling_std = data.rolling(window=2).std()
print("\n2-Period Rolling Standard Deviation:")
print(rolling_std)
Output:
3-Period Rolling Mean:
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 12.333333
2023-04-30 13.333333
2023-05-31 14.666667
2023-06-30 15.666667
Freq: M, Name: Value, dtype: float64
2-Period Rolling Standard Deviation:
2023-01-31 NaN
2023-02-28 1.414214
2023-03-31 2.121320
2023-04-30 1.414214
2023-05-31 2.121320
2023-06-30 1.414214
Freq: M, Name: Value, dtype: float64
The result for each window is typically aligned with the end of the window. For a window size of n
, the first n-1
values of the result will be NaN
because there aren't enough preceding data points to fill the window.
Visualizing the original series alongside its rolling mean often provides useful insights into underlying trends.
Original monthly data compared to its 3-period rolling mean. The rolling mean provides a smoother representation of the series' progression.
.expanding()
A related concept is the expanding window. Unlike a rolling window with a fixed size, an expanding window includes all data points from the start of the series up to the current point. It's useful for calculating cumulative statistics.
The .expanding()
method works similarly to .rolling()
, but you typically only need to specify min_periods
(often set to 1) rather than a fixed window size.
# Calculate the cumulative sum
expanding_sum = data.expanding(min_periods=1).sum()
print("\nExpanding Sum (Cumulative Sum):")
print(expanding_sum)
# Calculate the expanding mean (cumulative average)
expanding_mean = data.expanding(min_periods=1).mean()
print("\nExpanding Mean (Cumulative Average):")
print(expanding_mean)
Output:
Expanding Sum (Cumulative Sum):
2023-01-31 10.0
2023-02-28 22.0
2023-03-31 37.0
2023-04-30 50.0
2023-05-31 66.0
2023-06-30 84.0
Freq: M, Name: Value, dtype: float64
Expanding Mean (Cumulative Average):
2023-01-31 10.000000
2023-02-28 11.000000
2023-03-31 12.333333
2023-04-30 12.500000
2023-05-31 13.200000
2023-06-30 14.000000
Freq: M, Name: Value, dtype: float64
These Pandas functions .shift()
, .diff()
, .rolling()
, and .expanding()
provide a powerful toolkit for manipulating time series data based on its temporal dependencies. Mastering them is essential for cleaning data, creating informative features, and preparing your series for the visualization and modeling techniques we'll cover next.
© 2025 ApX Machine Learning