As discussed in the previous section, standard Q-learning and its deep learning extension, DQN, suffer from a tendency to overestimate the value of actions. This occurs because the update rule involves a maximization step over potentially noisy or inaccurate Q-value estimates. Let's look at the standard DQN target calculation again:
ytDQN=rt+γa′maxQθ′(st+1,a′)Here, the target network θ′ is used for two purposes: first, to select the action a′ that is believed to yield the highest Q-value in the next state st+1, and second, to evaluate the Q-value of that selected action. If the target network happens to overestimate the value for any action a′, the max operator will likely pick that overestimated value, leading to a positively biased target ytDQN. This consistent overestimation can propagate through the learning process, potentially resulting in suboptimal policy convergence and instability.
Double DQN (DDQN), proposed by Hado van Hasselt, Arthur Guez, and David Silver in 2015, offers an elegant solution to mitigate this overestimation bias. The core idea is to decouple the selection of the best next action from the evaluation of that action's value. Instead of using the same network (the target network) for both tasks, DDQN uses two different networks.
Specifically, DDQN modifies the target calculation as follows:
Compare this carefully to the standard DQN target. In DQN, the argmax
and the value estimation both rely solely on the target network Qθ′. In DDQN, the argmax
uses the online network Qθ, while the value estimation uses the target network Qθ′.
The online network θ and the target network θ′ represent different sets of parameters (recall that θ′ is typically a delayed copy of θ). While both networks might produce estimation errors, it's less likely that both networks will overestimate the value of the same action simultaneously.
By using the online network to select the action (a∗=argmaxa′Qθ(st+1,a′)), we are still picking what the current policy believes is best. However, by then using the target network to evaluate that specific action's value (Qθ′(st+1,a∗)), we get a potentially less biased estimate. If the online network mistakenly picks an action whose value is overestimated by Qθ but not significantly overestimated by Qθ′, the resulting target value ytDDQN will be less inflated than ytDQN would have been. This helps break the positive feedback loop of overestimation.
The diagram below illustrates the difference in how the target value is computed in DQN versus DDQN.
Comparison of target value computation flows. DQN uses only the target network for both selecting the maximum-value action and evaluating its value. DDQN uses the online network to select the action and the target network to evaluate that chosen action.
Implementing DDQN requires only a small modification to a standard DQN implementation. Instead of calculating max(target_q_values)
for the next state, you need to:
argmax
.The rest of the DQN machinery, including experience replay and the periodic updating of the target network, remains unchanged. The computational overhead added by DDQN is typically negligible, as it mainly involves an extra forward pass through the online network during the target calculation step, which is usually much less expensive than the backward pass for gradient updates.
By reducing the overestimation bias, DDQN often leads to more stable training and can converge to better policies compared to the original DQN algorithm, making it a valuable and widely used improvement in deep reinforcement learning.
© 2025 ApX Machine Learning