As discussed in Chapter 1, standard Reinforcement Learning algorithms often struggle when state or action spaces become very large. Representing the value function or policy explicitly for every state (or state-action pair) becomes computationally infeasible. Function approximation offers a solution, but simple linear approximators lack the capacity to capture the complex relationships present in many challenging problems, such as learning from raw pixel data in video games or robotic control from high-dimensional sensor inputs.
Deep Q-Networks (DQN) marked a significant step forward by integrating deep neural networks with the Q-learning algorithm. Instead of a table or a linear function, DQN uses a neural network to approximate the optimal action-value function, Q∗(s,a).
The core idea is straightforward: we define a neural network, often called the Q-network, parameterized by weights θ. This network takes a representation of the state s as input and outputs a vector of Q-values, one for each possible action a in that state. We denote the network's output for a given state s and action a as Q(s,a;θ).
For example, when learning to play Atari games from screen pixels, a common architecture involves:
A typical convolutional neural network architecture used in DQN for processing image-based inputs.
How do we train the weights θ of this Q-network? We adapt the standard Q-learning update rule. Recall the Bellman equation for the optimal action-value function Q∗(s,a):
Q∗(s,a)=Es′∼P(⋅∣s,a)[r+γa′maxQ∗(s′,a′)]Q-learning iteratively approaches this optimum using updates based on experienced transitions (st,at,rt,st+1). In the context of function approximation, we want our network Q(s,a;θ) to satisfy this equation. We can define a target value yt based on the reward received and the estimated value of the next state:
yt=rt+γa′maxQ(st+1,a′;θ)We then train the network by minimizing the difference between the target yt and the network's current prediction Q(st,at;θ). A common loss function is the Mean Squared Error (MSE), often referred to as the Mean Squared Bellman Error (MSBE):
L(θ)=E(st,at,rt,st+1)[(yt−Q(st,at;θ))2]The expectation is typically approximated by averaging the squared error over a mini-batch of transitions sampled from the agent's experience. The network weights θ are updated using gradient descent methods (like RMSprop or Adam) to minimize this loss.
Applying this update rule directly with a non-linear function approximator like a deep neural network proved to be notoriously unstable in early attempts. This instability arises primarily from two sources, related to the "Deadly Triad" mentioned in Chapter 1 (function approximation, bootstrapping, and off-policy learning):
The original DQN paper introduced two fundamental techniques to mitigate these instabilities and enable stable training of deep Q-networks:
Experience Replay: Instead of training on samples immediately as they are collected, transitions (st,at,rt,st+1) are stored in a large buffer called the replay memory or replay buffer (D). During training, mini-batches of transitions are sampled randomly from this buffer. This breaks the temporal correlations between samples within a batch, leading to more stable and efficient learning. It also allows the agent to reuse past experiences multiple times. (This will be detailed in the section "Experience Replay Mechanism").
Target Network: To address the moving target issue, DQN uses a separate, periodically updated target network, denoted Q^, with parameters θ−. This target network has the same architecture as the online Q-network (Q with parameters θ) but its weights are kept frozen for a number of steps. The target values yt are computed using this fixed target network:
yt=rt+γa′maxQ^(st+1,a′;θ−)The online network Q(s,a;θ) is then trained using gradient descent to minimize the MSE loss: L(θ)=E(s,a,r,s′)∼D[(yt−Q(s,a;θ))2]. Periodically (e.g., every C training steps), the weights of the target network are updated by copying the weights from the online network: θ−←θ. This use of a fixed target network significantly stabilizes the learning process. (This will be detailed in the section "Target Networks for Training Stability").
Combining these ideas leads to the following algorithm:
episode
= 1 to M
:
a. Observe initial state s1.
b. For t
= 1 to T
(or until episode termination):
i. Select action at using an ϵ-greedy strategy based on Q(st,⋅;θ) (i.e., with probability ϵ select a random action, otherwise select at=argmaxaQ(st,a;θ)).
ii. Execute action at in the environment.
iii. Observe reward rt and next state st+1. Determine if st+1 is a terminal state.
iv. Store the transition (s_t, a_t, r_t, s_{t+1}, \text{is_terminal}) in D.
v. Sample a random mini-batch of K transitions (sj,aj,rj,sj+1,is_terminalj) from D.
vi. For each transition j in the mini-batch, calculate the target yj:
If is_terminalj is true:
yj=rj
Else:
yj=rj+γmaxa′Q^(sj+1,a′;θ−)
vii. Perform a gradient descent step on the loss function with respect to the online network parameters θ:
L=K1∑j=1K(yj−Q(sj,aj;θ))2
θ←θ−α∇θL
viii. Every C steps, update the target network weights: θ−←θ.
c. Optionally decay ϵ.A schematic overview of the Deep Q-Network (DQN) algorithm, illustrating the interaction between the agent (containing the online and target networks), the environment, the replay buffer, and the training update mechanism.
The introduction of DQN was a landmark achievement, demonstrating that deep neural networks could be trained effectively for reinforcement learning tasks, enabling agents to learn complex policies directly from high-dimensional sensory inputs like pixels. It formed the foundation upon which many subsequent advancements in deep reinforcement learning have been built, several of which we will explore in the following sections of this chapter.
© 2025 ApX Machine Learning