All Courses

Time Series Data Handling in Pandas

Time series data, sequences of observations recorded over time, are fundamental in many machine learning domains, including forecasting, anomaly detection, and signal processing. Stock prices, sensor readings, weather patterns, and user activity logs are all examples of time series. Pandas provides excellent tools specifically designed for handling date and time data efficiently. Building upon the data loading and manipulation techniques we've covered, this section focuses on how to work with time-based indexing, resampling, and other time series specific operations within Pandas.

Understanding Date and Time in Pandas

At the heart of Pandas' time series capabilities are specialized data types and index structures.

Timestamp: Represents a single point in time. It's the Pandas equivalent of Python's datetime.datetime object but is more efficient and integrates well with NumPy and Pandas operations.
DatetimeIndex: An index composed of Timestamp objects. When you set a column of datetime objects as the index of a DataFrame, it becomes a DatetimeIndex. This enables powerful time-based indexing and slicing.
Period: Represents a time span, like a specific month, year, or quarter.
Timedelta: Represents a duration or difference between two points in time.

While Period and Timedelta are useful, our primary focus for most ML data preparation tasks will be on Timestamp and DatetimeIndex.

Creating Time Series Data

You often encounter time information as strings in data files. Pandas provides flexible ways to convert these into proper datetime objects and set up a DatetimeIndex.

Converting Strings to Datetime Objects

The pd.to_datetime() function is the main tool for converting scalar values, lists, or Series containing date-like strings into Pandas Timestamp objects.

import pandas as pd

# Convert single strings
print(pd.to_datetime('2023-10-26'))
print(pd.to_datetime('27/10/2023', dayfirst=True)) # Specify format if needed

# Convert a list or Series
date_strings = ['2023-11-01', '2023-11-05', '2023-11-10']
datetime_series = pd.to_datetime(date_strings)
print(datetime_series)

Generating Date Ranges

For creating sequences of dates, pd.date_range() is very useful. You can specify the start date, end date, and the frequency (e.g., daily, monthly, hourly).

# Daily frequency (default)
daily_index = pd.date_range(start='2023-01-01', end='2023-01-05')
print(daily_index)

# Monthly frequency
monthly_index = pd.date_range(start='2023-01-01', periods=4, freq='M') # 'M' is month-end frequency
print(monthly_index)

# Business day frequency
business_day_index = pd.date_range(start='2023-10-23', periods=5, freq='B')
print(business_day_index)

Common frequency strings include 'D' (daily), 'B' (business daily), 'W' (weekly), 'M' (month end), 'MS' (month start), 'Q' (quarter end), 'QS' (quarter start), 'A' (year end), 'AS' (year start), 'H' (hourly), 'T' or 'min' (minutely), 'S' (secondly).

Parsing Dates During Data Loading

Often, your dataset will have one or more columns containing date information. You can instruct Pandas to parse these directly while loading the data using the parse_dates argument in functions like pd.read_csv().

# Assume 'data.csv' has columns 'date_str' and 'value'
# df = pd.read_csv('data.csv', parse_dates=['date_str'])

# Create a sample DataFrame for demonstration
data = {'date_str': ['2023-01-15', '2023-01-16', '2023-01-17'],
        'value': [10, 12, 11]}
df = pd.DataFrame(data)
df['date_str'] = pd.to_datetime(df['date_str']) # Convert manually if not parsed on load
print(df.info()) # Note the date_str column is now datetime64[ns]

Time Series Indexing and Selection

Once you have a DatetimeIndex, selecting data based on time becomes intuitive. First, you typically set the datetime column as the DataFrame's index.

import numpy as np

# Create sample time series data
dates = pd.date_range('20230101', periods=100, freq='D')
ts_data = pd.Series(np.random.randn(100), index=dates)
print(ts_data.head())

# Set the date column as index (if not already)
# Assuming 'df' has a datetime column named 'date_col'
# df.set_index('date_col', inplace=True)
# For our example 'ts_data', the index is already set

# Select a specific date
print("\nData for 2023-01-05:")
print(ts_data['2023-01-05'])

# Select data for a specific year
print("\nData for the year 2023:")
print(ts_data['2023'].head()) # Works because index is sorted

# Select data for a specific month
print("\nData for January 2023:")
print(ts_data['2023-01'].head())

# Slice a date range
print("\nData from 2023-02-10 to 2023-02-15:")
print(ts_data['2023-02-10':'2023-02-15'])

This powerful indexing relies on the DatetimeIndex being sorted. If you create or modify the index in a way that unsorts it, you might need to call df.sort_index() for these slicing methods to work correctly.

Resampling Time Series Data

Resampling involves changing the frequency of your time series observations. This is a common operation for aggregating data to a lower frequency (downsampling) or increasing the frequency (upsampling), often requiring interpolation. The primary tool for this is the .resample() method.

Downsampling

When downsampling, you aggregate data from a higher frequency to a lower frequency (e.g., daily data to monthly data). You need to specify how to aggregate the data points that fall within each new, larger time bin (e.g., mean, sum, max, min).

# Sample daily data
dates = pd.date_range('2023-01-01', periods=35, freq='D')
daily_values = pd.Series(np.random.rand(35) * 100, index=dates)

# Resample to monthly frequency, taking the mean
monthly_mean = daily_values.resample('M').mean()
print("\nMonthly Mean:")
print(monthly_mean)

# Resample to weekly frequency, taking the sum
weekly_sum = daily_values.resample('W').sum()
print("\nWeekly Sum:")
print(weekly_sum)

Upsampling

Upsampling increases the frequency (e.g., monthly data to daily data). Since you don't have data for the newly created time points, you need to decide how to fill them. Common methods include forward-fill (ffill), back-fill (bfill), or interpolation.

# Sample monthly data
dates_monthly = pd.date_range('2023-01-01', periods=4, freq='MS') # Month Start
monthly_values = pd.Series([10, 15, 12, 18], index=dates_monthly)

# Resample to daily frequency, using forward fill
daily_ffill = monthly_values.resample('D').ffill()
print("\nDaily Data (Forward Fill):")
print(daily_ffill.head(10)) # Show first 10 days

# Resample to daily frequency, using interpolation
daily_interpolated = monthly_values.resample('D').interpolate(method='linear')
print("\nDaily Data (Linear Interpolation):")
print(daily_interpolated.head(10)) # Show first 10 days

Rolling Window Operations

Rolling window calculations are essential for analyzing trends and smoothing time series data. They operate on a sliding window of a defined size, applying a function (like mean, sum, standard deviation) to the data within the window as it moves through the series.

The .rolling() method is used for this. You specify the window size (number of periods).

# Sample data
dates = pd.date_range('2023-01-01', periods=20, freq='D')
data = pd.Series(np.random.randn(20).cumsum() + 50, index=dates) # Random walk like data

# Calculate the 5-day rolling mean
rolling_mean_5d = data.rolling(window=5).mean()

print("\nOriginal Data:")
print(data.head(7))
print("\n5-Day Rolling Mean:")
print(rolling_mean_5d.head(7)) # Note initial NaN values

The first window - 1 values of the rolling calculation will be NaN because there aren't enough preceding data points to fill the window.

Original time series data compared against its 5-day rolling mean, illustrating the smoothing effect.

Other common rolling functions include .rolling(...).sum(), .rolling(...).std(), .rolling(...).max(), etc. Rolling windows are frequently used in feature engineering for time series models, capturing recent trends or volatility.

Shifting and Lagging Data

Sometimes you need to compare time series values with values from previous periods. This is done using the .shift() method, which pushes the data forward or backward by a specified number of periods. This is extremely useful for creating lagged features in forecasting models or calculating period-over-period changes.

# Create sample data
dates = pd.date_range('2023-01-01', periods=5, freq='D')
ts = pd.Series([10, 12, 11, 13, 14], index=dates)

# Shift data forward by 1 period (lag)
ts_lag1 = ts.shift(1)
print("\nOriginal Series:")
print(ts)
print("\nLagged Series (shift=1):")
print(ts_lag1)

# Calculate percentage change from the previous period
pct_change = (ts - ts.shift(1)) / ts.shift(1) * 100
# Alternatively: ts.pct_change() * 100
print("\nPercentage Change:")
print(pct_change)

# Shift data backward by 1 period (lead)
ts_lead1 = ts.shift(-1)
print("\nLead Series (shift=-1):")
print(ts_lead1)

Notice that shifting introduces NaN values at the beginning (for positive shifts) or end (for negative shifts) of the series.

Handling Time Zones

Real-world time series data often includes time zone information, or requires it for correct interpretation. Pandas provides support for time zone localization and conversion.

tz_localize(): Assigns a time zone to naive datetime objects (those without time zone info).
tz_convert(): Converts datetime objects from one time zone to another.

# Create a naive DatetimeIndex
naive_dates = pd.date_range('2023-10-26 09:00:00', periods=3, freq='H')
ts_naive = pd.Series([1, 2, 3], index=naive_dates)
print("\nNaive Time Series:")
print(ts_naive)

# Localize to a specific time zone (e.g., US/Eastern)
ts_eastern = ts_naive.tz_localize('US/Eastern')
print("\nLocalized to US/Eastern:")
print(ts_eastern)

# Convert to another time zone (e.g., Europe/Berlin)
ts_berlin = ts_eastern.tz_convert('Europe/Berlin')
print("\nConverted to Europe/Berlin:")
print(ts_berlin)

Working correctly with time zones is important when dealing with data spanning different geographical regions or when daylight saving time changes are relevant.

Mastering these Pandas time series tools is essential for preparing temporal data for machine learning. You can now index data based on time, change its frequency through resampling, analyze trends with rolling windows, and create lagged features using shifting, all fundamental steps in building effective time-dependent models.

Was this section helpful?