Now that we understand the architecture of a Deep Q-Network, including the use of Experience Replay and Target Networks, the next logical step is to define how we actually train the main Q-network. How do we adjust its parameters, represented by θ, so that its output Q(s,a;θ) gets closer to the true optimal action-value function Q∗(s,a)?
Recall that in standard Q-learning, we iteratively update our estimate of Q(s,a) based on the Bellman equation. We move our current estimate towards a target value derived from the observed reward and the estimated value of the next state. DQN borrows this core idea but adapts it for function approximation using neural networks.
We need a way to quantify the error between our network's current prediction for a given state-action pair (s,a) and a target value. This error measure is defined by a loss function. The goal of training is to minimize this loss.
For a transition (s,a,r,s′) sampled from the experience replay buffer, our Q-network predicts the value Q(s,a;θ). What should the target value be? Just like in Q-learning, the target aims to incorporate the immediate reward r and the discounted value of the best action possible from the next state s′. However, to improve stability (as discussed in the previous section on Target Networks), we use the target network with its parameters θ− to estimate the value of the next state.
The target value, often denoted as y, is calculated as:
y=r+γa′maxQ(s′,a′;θ−)Here, γ is the discount factor, and maxa′Q(s′,a′;θ−) represents the maximum Q-value predicted by the target network for the next state s′ across all possible next actions a′. If s′ is a terminal state, the target is simply y=r.
With the predicted value Q(s,a;θ) and the target value y, we can now define the loss. Since we want our prediction to be as close as possible to the target, a natural choice is the Mean Squared Error (MSE) loss, commonly used in regression problems. For a single transition, the squared error is (y−Q(s,a;θ))2.
In practice, we compute the loss over a mini-batch of transitions sampled from the experience replay buffer D. The loss function L(θ) that we aim to minimize is the expectation of this squared error over sampled transitions:
L(θ)=E(s,a,r,s′)∼D[(y−Q(s,a;θ))2]Or, more practically, for a mini-batch of N transitions {(si,ai,ri,si′)}i=1N:
L(θ)≈N1i=1∑N((ri+γa′maxQ(si′,a′;θ−))−Q(si,ai;θ))2This loss function measures the average squared difference between the target values (calculated using the target network) and the Q-values predicted by our main network.
The training process involves iteratively performing the following steps:
where α is the learning rate.
It's important to note that during the calculation of the gradient ∇θL(θ), the target values y are treated as fixed constants. The gradients only flow through the main network parameters θ, not the target network parameters θ−. This decoupling, achieved by using the separate target network and experience replay, is fundamental to stabilizing the training process for DQNs. Without these mechanisms, the constantly shifting targets and correlated data samples would make convergence difficult.
Periodically, after a set number of training steps, the weights of the target network θ− are updated to match the weights of the main network θ. This ensures the target values gradually adapt to the improving policy represented by the main network, while still providing stability over shorter timescales.
Understanding this loss function and the training procedure is central to implementing and tuning DQN agents. In the next section, we'll walk through a practical implementation.
© 2025 ApX Machine Learning