As we discussed, directly applying Q-learning updates with non-linear function approximators like deep neural networks can lead to unstable training. The core issue arises from the fact that the target value used in the Temporal Difference (TD) error calculation depends on the same network parameters that are currently being updated. Let's look at the standard Q-learning update target for a transition (st,at,rt,st+1):
yt=rt+γa′maxQ(st+1,a′;θ)The loss function, often Mean Squared Error (MSE), would then be calculated using this target yt and the current Q-value estimate Q(st,at;θ):
L(θ)=E[(yt−Q(st,at;θ))2]=E[(rt+γa′maxQ(st+1,a′;θ)−Q(st,at;θ))2]Notice that the parameters θ appear in both the target value calculation (through maxa′Q(st+1,a′;θ)) and the value being updated (Q(st,at;θ)). When we compute the gradient and update θ, the target yt also shifts. This "moving target" problem can cause oscillations or even divergence during training, as the network tries to chase a constantly changing objective.
To mitigate this instability, the DQN algorithm introduced the concept of a target network. This is a separate neural network whose architecture is identical to the main Q-network (often called the online network or policy network) but whose parameters are kept frozen for a period.
Let's denote the parameters of the online network as θ and the parameters of the target network as θ−. The target network is used specifically to calculate the TD target value. The modified TD target becomes:
yt=rt+γa′maxQ(st+1,a′;θ−)Now, the loss function is calculated as:
L(θ)=E[(yt−Q(st,at;θ))2]=E[(rt+γa′maxQ(st+1,a′;θ−)−Q(st,at;θ))2]Crucially, the parameters θ− in the target calculation are held fixed during the gradient computation and update step for the online network parameters θ. The online network Q(s,a;θ) is updated based on the stable targets provided by the target network Q(s,a;θ−).
The target network's parameters θ− cannot remain fixed indefinitely, otherwise, the target values would become outdated and prevent the agent from learning improved Q-values. The standard approach in DQN is to periodically copy the parameters from the online network to the target network.
This update happens every C training steps (where C is a hyperparameter):
θ−←θevery C stepsBetween these updates, θ− remains constant, providing a stable target for the online network's updates over C steps.
Flow showing the online network (Q(s,a;θ)) being updated based on gradients from the loss, while the target network (Q(s′,a′;θ−)) provides stable targets (yt) for the loss calculation. The target network's parameters (θ−) are periodically updated by copying the online network's parameters (θ).
Using a target network significantly improves the stability of training deep Q-networks. By decoupling the target calculation from the immediately updated parameters, it prevents the destructive feedback loops that can arise from the "moving target" problem.
The target network update frequency, C, is an important hyperparameter.
Typical values for C in practice range from hundreds to tens of thousands of update steps, depending on the environment and specific algorithm configuration. Finding a suitable value often requires empirical tuning.
An alternative to the periodic "hard" copy is the "soft" update, also known as Polyak averaging. This method updates the target network parameters slowly towards the online network parameters at every training step:
θ−←τθ+(1−τ)θ−Here, τ is a small constant (e.g., τ=0.001 or τ=0.005). This approach results in a target network that changes more smoothly over time compared to the periodic hard updates. While originally popularized in algorithms like Deep Deterministic Policy Gradient (DDPG) for continuous control, soft updates can also be used with DQN variants.
In summary, target networks are a foundational technique for stabilizing deep Q-learning. By using a periodically updated copy of the main network to generate TD targets, they break the correlation that leads to instability, allowing deep neural networks to be trained effectively for complex reinforcement learning tasks.
© 2025 ApX Machine Learning