In the previous section, we discussed how experience replay helps break the correlation between consecutive training samples. However, another significant source of instability arises when training Q-networks: the constantly changing target value used in the Temporal Difference (TD) update.
Recall the standard Q-learning update rule applied to function approximation. We aim to minimize the difference between our current Q-value estimate Q(St,At;θ) and a target value. This target is typically calculated using the reward Rt+1 and the estimated value of the next state, St+1. In Q-learning, this involves finding the maximum Q-value for the next state using the current network parameters θ:
Targett=Rt+1+γa′maxQ(St+1,a′;θt)The loss function, often the Mean Squared Error (MSE), would then be something like:
L(θt)=E[((Rt+1+γa′maxQ(St+1,a′;θt))−Q(St,At;θt))2]Notice that the parameters θt appear in both the target calculation and the value we are trying to adjust (Q(St,At;θt)). When we perform gradient descent to update θt, we are essentially chasing a moving target. As the network's weights θt change in each step, the target value itself shifts. This interdependence can lead to oscillations or even divergence during training, making learning unstable. It's like trying to hit a target that moves every time you adjust your aim.
To address this "moving target" problem, the DQN algorithm introduces a second neural network: the target network. This target network, let's denote its parameters by θ−, is essentially a clone of the online Q-network (the one we are actively training, with parameters θ).
Here's how it works:
This mechanism provides stability because the target value yt remains fixed for C consecutive updates of the online network θ. The online network is now learning to approximate a stationary target, which significantly simplifies the learning dynamics and reduces the likelihood of oscillations and divergence. The frequency C of updating the target network is a hyperparameter that needs to be chosen; typical values might range from hundreds to thousands of steps, depending on the specific problem.
Interaction flow showing how the online network (Q(St,A;θ)) and the target network (Q(St+1,a′;θ−)) are used during a DQN update step. The online network parameters θ are updated frequently via gradient descent, while the target network parameters θ− are updated only periodically by copying from θ.
By combining fixed Q-targets with experience replay, DQN addresses two major sources of instability inherent in applying Q-learning with complex function approximators like deep neural networks. Experience replay decorrelates the data samples, and target networks provide stable targets for the learning updates. Together, these techniques were fundamental to the success of early DQNs in learning to play Atari games from raw pixel inputs.
© 2025 ApX Machine Learning