Training deep neural networks effectively typically relies on the assumption that the input data points are independent and identically distributed (i.i.d.). However, when an RL agent interacts with an environment sequentially, the generated experiences (st,at,rt,st+1) are anything but i.i.d. Consecutive states are highly correlated, and the distribution of experiences changes as the agent's policy evolves. Training a deep Q-network directly on these sequential, correlated experiences often leads to instability and poor convergence. The network can easily overfit to recent, correlated trajectories, forgetting past, potentially valuable experiences.
Experience replay is a biologically inspired mechanism introduced with DQN to address these challenges. Instead of performing an update using only the most recent transition, the agent stores its experiences in a large data structure called a replay buffer (or replay memory), often denoted as D. This buffer typically has a fixed capacity and operates like a FIFO (First-In, First-Out) queue: when the buffer is full and a new experience arrives, the oldest one is removed.
The core process works as follows:
The Experience Replay process: Transitions are stored in a buffer, and mini-batches are randomly sampled from this buffer to perform Q-network updates.
This random sampling has two primary benefits:
For each sampled transition (sj,aj,rj,sj+1) in a mini-batch B, the target value yj is computed, often using a separate target network (discussed next) with parameters θ− to further stabilize training:
yj=rj+γa′maxQ(sj+1,a′;θ−)If sj+1 is a terminal state, then yj=rj. The loss, typically Mean Squared Error (MSE), is then computed over the mini-batch:
L(θ)=∣B∣1j∈B∑(yj−Q(sj,aj;θ))2The parameters θ of the main Q-network are then updated using gradient descent on this loss L(θ).
The size of the replay buffer is an important hyperparameter. A larger buffer stores a more diverse set of experiences, reducing correlations but potentially including outdated information from older policies. A smaller buffer adapts faster to recent policy changes but might suffer from stronger correlations. Typical buffer sizes range from tens of thousands to millions of transitions, depending on the complexity of the task and memory constraints.
Experience replay was a significant component that enabled the success of DQN, transforming Q-learning with non-linear function approximators from an often unstable technique into a powerful and widely applicable deep reinforcement learning algorithm. While simple uniform sampling is standard, subsequent enhancements like Prioritized Experience Replay (PER), which we will discuss later in this chapter, further refine the sampling strategy for improved efficiency.
© 2025 ApX Machine Learning