As mentioned in the chapter introduction, directly applying Q-learning updates using consecutive samples (st,at,rt+1,st+1),(st+1,at+1,rt+2,st+2),… to train a deep neural network approximator Q(s,a;θ) presents significant challenges. Neural networks often assume that the training data points are independent and identically distributed (IID). However, in reinforcement learning:
To address these issues, Deep Q-Networks employ a technique called Experience Replay.
The core idea is simple yet effective: instead of using the most recent experience for training immediately, the agent stores its experiences in a large memory buffer, often called a replay buffer or replay memory. An "experience" or "transition" is typically stored as a tuple: (st,at,rt+1,st+1).
The replay buffer usually has a fixed capacity (e.g., storing the last 1 million transitions). As new experiences arrive, they are added to the buffer, potentially overwriting the oldest ones if the buffer is full.
During the learning phase, instead of using the latest transition, the algorithm samples a minibatch of transitions randomly from the replay buffer. These randomly sampled transitions are then used to perform a gradient descent update on the Q-network's parameters θ.
Flow showing agent interaction, storing transitions in the replay buffer, and the separate learning process sampling from the buffer to update the DQN.
collections.deque
with a maxlen
) is often used for efficient implementation, allowing easy addition of new experiences and removal of old ones.Here's a conceptual Python snippet illustrating the storage and sampling:
import random
from collections import deque, namedtuple
# Define the structure for a transition
Transition = namedtuple('Transition',
('state', 'action', 'next_state', 'reward'))
class ReplayMemory:
def __init__(self, capacity):
# Use deque as a fixed-size circular buffer
self.memory = deque([], maxlen=capacity)
def push(self, *args):
"""Save a transition"""
self.memory.append(Transition(*args))
def sample(self, batch_size):
"""Sample a random batch of transitions"""
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
# --- Usage ---
# Initialize buffer
memory = ReplayMemory(10000) # Capacity of 10,000
# During interaction loop:
# state, action, next_state, reward = get_experience_from_env(...)
# memory.push(state, action, next_state, reward)
# During learning step (if buffer has enough samples):
# if len(memory) > BATCH_SIZE:
# transitions = memory.sample(BATCH_SIZE)
# # Unpack the batch:
# # batch = Transition(*zip(*transitions))
# # Perform gradient update using this batch...
Experience replay is a foundational technique that significantly contributed to the success of DQNs, allowing them to learn effectively from high-dimensional inputs like pixels. It elegantly addresses the core instabilities arising from correlated data in RL training pipelines. The next section discusses another important technique, Fixed Q-Targets, which tackles the problem of non-stationary target values.
© 2025 ApX Machine Learning