When we train a neural network to approximate Q-values, we face a potential instability issue inherent in the update process. Recall the basic idea behind the updates in Q-learning, which inspires the DQN loss function. We want our Q-network, parameterized by weights θ, to predict a value Q(St,At;θ) that gets closer to the target value, often calculated as:
yt≈Rt+1+γa′maxQ(St+1,a′;θ)Notice something important here: the weights θ appear on both sides of this implicit equation. We are adjusting θ to make Q(St,At;θ) match a target that also depends on θ.
Imagine trying to learn the distance to a specific point, but every time you take a measurement and adjust your estimate, the point itself moves slightly based on your adjustment. This makes convergence difficult. Similarly, in standard Q-learning with a neural network, every gradient update step changes the weights θ. This, in turn, changes the Q-values for the next state St+1, effectively shifting the target yt before the network has converged towards the previous target.
This "moving target" problem can lead to oscillations and divergence during training. The network might struggle to settle on accurate Q-values because the values it's trying to predict are constantly changing as a direct result of the learning process itself. This instability was a significant hurdle in early attempts to combine deep learning with reinforcement learning.
To address this instability, the DQN algorithm introduced a clever technique: using a separate Target Network. The core idea is to calculate the target value yt using a fixed set of parameters that are not updated immediately.
Here's how it works:
We maintain two neural networks:
When calculating the TD target for our loss function, we use the Target Network:
yt=Rt+1+γa′maxQ(St+1,a′;θ−)Crucially, the weights θ− used here are held constant for a period of time.
The loss function is then calculated using the online network's prediction and this stable target:
L(θ)=E(St,At,Rt+1,St+1)∼D[(yt−Q(St,At;θ))2] L(θ)=E(St,At,Rt+1,St+1)∼D[((Rt+1+γa′maxQ(St+1,a′;θ−))−Q(St,At;θ))2]Only the online network's weights θ are updated based on the gradient of this loss L(θ). The target network weights θ− are not modified by this gradient step.
This decouples the target calculation from the weights being actively updated, providing a more stable learning signal.
Diagram illustrating the DQN training process with Online and Target Networks. The Target Network (θ−) provides stable targets for the loss calculation, while the Online Network (θ) is updated via optimization. Weights are periodically copied from the Online to the Target Network.
If the target network weights θ− never changed, the online network θ would be chasing an increasingly outdated target. Therefore, we need to update θ− periodically to reflect the progress made by the online network.
A common and effective strategy is to perform a hard update: Every C training steps (where C is a hyperparameter, often hundreds or thousands of steps), copy the weights from the online network to the target network:
θ−←θBetween these updates, θ− remains fixed. This periodic update ensures the target network slowly tracks the learned policy, providing stability without being completely static. The choice of C involves a trade-off: smaller C makes the target track the online network more quickly (potentially reintroducing some instability), while larger C provides more stability but might slow down learning if the target becomes too outdated.
Note: Another approach is a "soft update" or Polyak averaging, where the target weights are updated slightly at every step: θ−←τθ+(1−τ)θ−, with τ≪1. This provides smoother updates but adds another hyperparameter τ. For simplicity, we often start with hard updates.
Using a target network provides several benefits:
Target networks work effectively alongside Experience Replay. While experience replay breaks the temporal correlation between consecutive samples drawn from the environment, the target network breaks the correlation between the network's current estimate Q(St,At;θ) and the target value yt. Together, these two techniques (Experience Replay and Target Networks) were fundamental to the success of the original DQN algorithm and remain standard practice in many deep reinforcement learning implementations. They provide the stability needed for deep neural networks to learn effectively from reinforcement signals.
© 2025 ApX Machine Learning