Standard Reinforcement Learning algorithms often struggle when state or action spaces become very large. Representing the value function or policy explicitly for every state (or state-action pair) becomes computationally infeasible. While function approximation offers a solution, simple linear approximators lack the capacity to capture the complex relationships present in many challenging problems, such as learning from raw pixel data in video games or robotic control from high-dimensional sensor inputs.
Deep Q-Networks (DQN) marked a significant step forward by integrating deep neural networks with the Q-learning algorithm. Instead of a table or a linear function, DQN uses a neural network to approximate the optimal action-value function, .
The core idea is straightforward: we define a neural network, often called the Q-network, parameterized by weights . This network takes a representation of the state as input and outputs a vector of Q-values, one for each possible action in that state. We denote the network's output for a given state and action as .
For example, when learning to play Atari games from screen pixels, a common architecture involves:
A typical convolutional neural network architecture used in DQN for processing image-based inputs.
How do we train the weights of this Q-network? We adapt the standard Q-learning update rule. Recall the Bellman equation for the optimal action-value function :
Q-learning iteratively approaches this optimum using updates based on experienced transitions . In the context of function approximation, we want our network to satisfy this equation. We can define a target value based on the reward received and the estimated value of the next state:
We then train the network by minimizing the difference between the target and the network's current prediction . A common loss function is the Mean Squared Error (MSE), often referred to as the Mean Squared Bellman Error (MSBE):
The expectation is typically approximated by averaging the squared error over a mini-batch of transitions sampled from the agent's experience. The network weights are updated using gradient descent methods (like RMSprop or Adam) to minimize this loss.
Applying this update rule directly with a non-linear function approximator like a deep neural network proved to be notoriously unstable in early attempts. This instability arises primarily from two sources, related to the "Deadly Triad" mentioned in Chapter 1 (function approximation, bootstrapping, and off-policy learning):
The original DQN paper introduced two fundamental techniques to mitigate these instabilities and enable stable training of deep Q-networks:
Experience Replay: Instead of training on samples immediately as they are collected, transitions are stored in a large buffer called the replay memory or replay buffer (). During training, mini-batches of transitions are sampled randomly from this buffer. This breaks the temporal correlations between samples within a batch, leading to more stable and efficient learning. It also allows the agent to reuse past experiences multiple times. (This will be detailed in the section "Experience Replay Mechanism").
Target Network: To address the moving target issue, DQN uses a separate, periodically updated target network, denoted , with parameters . This target network has the same architecture as the online Q-network ( with parameters ) but its weights are kept frozen for a number of steps. The target values are computed using this fixed target network:
The online network is then trained using gradient descent to minimize the MSE loss: . Periodically (e.g., every training steps), the weights of the target network are updated by copying the weights from the online network: . This use of a fixed target network significantly stabilizes the learning process. (This will be detailed in the section "Target Networks for Training Stability").
Combining these ideas leads to the following algorithm:
episode = 1 to M:
a. Observe initial state .
b. For t = 1 to T (or until episode termination):
i. Select action using an -greedy strategy based on (i.e., with probability select a random action, otherwise select ).
ii. Execute action in the environment.
iii. Observe reward and next state . Determine if is a terminal state.
iv. Store the transition (s_t, a_t, r_t, s_{t+1}, \text{is_terminal}) in .
v. Sample a random mini-batch of transitions from .
vi. For each transition in the mini-batch, calculate the target :
If is true:
Else:
vii. Perform a gradient descent step on the loss function with respect to the online network parameters :
viii. Every steps, update the target network weights: .
c. Optionally decay .A schematic overview of the Deep Q-Network (DQN) algorithm, illustrating the interaction between the agent (containing the online and target networks), the environment, the replay buffer, and the training update mechanism.
The introduction of DQN was a landmark achievement, demonstrating that deep neural networks could be trained effectively for reinforcement learning tasks, enabling agents to learn complex policies directly from high-dimensional sensory inputs like pixels. It formed the foundation upon which many subsequent advancements in deep reinforcement learning have been built, several of which we will explore in the following sections of this chapter.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•