Having explored the motivations for using deep neural networks with Q-learning and the techniques of Experience Replay and Target Networks to improve stability, we can now assemble these pieces into the standard Deep Q-Network (DQN) algorithm. This algorithm provides a framework for training an agent to learn optimal action-value functions in complex environments.
The core idea remains rooted in Q-learning: we want to learn an action-value function Q(s,a) that approximates the expected return after taking action a in state s and following the optimal policy thereafter. However, instead of a table, we use a deep neural network, parameterized by weights θ, to represent this function: Q(s,a;θ). To handle the instability issues discussed previously, we incorporate two main modifications:
- Experience Replay: We store the agent's experiences (state, action, reward, next state tuples) in a replay memory buffer. During training, we sample mini-batches of these experiences randomly from the buffer to update the network. This breaks the temporal correlations in the sequence of observations and smooths out changes in the data distribution.
- Target Network: We use a separate neural network, called the target network, with weights θ−, to calculate the target Q-values used in the update rule. The weights θ− are periodically copied from the primary Q-network's weights θ, but kept fixed for a certain number of steps in between updates. This provides a more stable target for the Q-learning update, preventing the target values from chasing the constantly changing predictions of the primary network.
The DQN Algorithm Flow
Here's a breakdown of the typical DQN training loop:
-
Initialization:
- Initialize the replay memory buffer D with a fixed capacity.
- Initialize the primary Q-network Q with random weights θ.
- Initialize the target Q-network Q^ with weights θ−=θ.
-
Episode Loop: For each episode:
- Reset the environment and get the initial state s1.
- Preprocess the initial state if necessary (e.g., stacking frames) to create the input representation for the network.
-
Step Loop: For each step t=1,2,...T within the episode:
- Action Selection: Choose an action at based on the current state st using an ϵ-greedy strategy derived from the primary Q-network:
- With probability ϵ, select a random action.
- With probability 1−ϵ, select at=argmaxaQ(st,a;θ).
(Often, ϵ is annealed, starting high and decreasing over time to shift from exploration to exploitation).
- Execute Action: Take action at in the environment. Observe the reward rt+1 and the next state st+1. Preprocess st+1.
- Store Experience: Store the transition tuple (st,at,rt+1,st+1) in the replay memory D.
- Sample Mini-batch: If the replay memory contains enough experiences, sample a random mini-batch of N transitions (sj,aj,rj+1,sj+1) from D.
- Calculate Targets: For each transition (sj,aj,rj+1,sj+1) in the mini-batch, calculate the target value yj:
- If sj+1 is a terminal state, the target is simply the reward: yj=rj+1.
- Otherwise, the target is calculated using the target network Q^:
yj=rj+1+γa′maxQ^(sj+1,a′;θ−)
where γ is the discount factor. Notice the max operation is performed over the outputs of the target network for the next state sj+1.
- Gradient Descent Update: Perform a gradient descent step on the primary Q-network weights θ. The loss function is typically the Mean Squared Error (MSE) between the predicted Q-values (from the primary network for the actions taken, aj) and the calculated target values yj:
L(θ)=N1j=1∑N(yj−Q(sj,aj;θ))2
Update θ using an optimizer like Adam or RMSprop: θ←θ−α∇θL(θ).
- Update Target Network: Every C steps (where C is a hyperparameter), update the target network weights by copying the primary network weights: θ−←θ.
- Advance State: Set st←st+1.
- Check Termination: If st is a terminal state, end the current episode.
High-level interaction flow in the DQN algorithm. The agent uses its Q-Network for action selection, stores experiences in Replay Memory, samples batches to compute targets using the Target Network, and updates the Q-Network via gradient descent. The Target Network is periodically updated from the Q-Network.
This structured approach, integrating experience replay and fixed Q-targets, allows DQN to successfully train deep neural networks for Q-learning, enabling agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels, which was a significant advancement in reinforcement learning. The next section explores some of the architectural choices for the neural networks used within DQNs.