As we established, tabular Q-learning hits a wall when dealing with environments that have a vast number of states. Imagine trying to store a Q-value for every possible configuration of pixels on a screen. It's simply not feasible. Deep Q-Networks (DQN) address this by replacing the Q-table with a deep neural network, often called the Q-network. This network acts as a function approximator, learning to estimate the Q-values.
Instead of looking up a value in a table, we feed the agent's current state representation, s, into the neural network. The network's job is to output an estimate of the action-value function, Q(s,a), for all possible actions a available in that state s.
Let θ represent the parameters (weights and biases) of this neural network. We can denote the network's output for state s and action a as Q(s,a;θ).
Diagram illustrating the basic flow within the DQN agent at decision time. The current state st is fed into the Q-network, which outputs estimated Q-values for all actions. An action selection strategy (like epsilon-greedy) then uses these Q-values to choose the action at to take in the environment.
How do we train this Q-network? We adapt the core idea from the Q-learning update rule. Recall the Bellman equation, which expresses the relationship between the Q-value of the current state-action pair and the Q-values of the next state:
Q∗(s,a)=E[Rt+1+γa′maxQ∗(St+1,a′)∣St=s,At=a]In standard Q-learning, we iteratively update our Q-table entries towards this target value. With DQN, we frame this as a supervised learning problem. We want our network's prediction, Q(St,At;θ), to approximate a target value derived from the Bellman equation.
For a given transition (St,At,Rt+1,St+1), the target value, yt, is calculated using the reward received and the maximum estimated Q-value for the next state, St+1:
yt=Rt+1+γa′maxQ(St+1,a′;θ)Notice something important here: in this basic formulation, we are using the same Q-network (with parameters θ) both to predict the Q-value for the current state-action pair (Q(St,At;θ)) and to estimate the maximum Q-value for the next state (maxa′Q(St+1,a′;θ)) needed to compute the target yt. As we will see later, this coupling can lead to instabilities during training.
The network learns by minimizing the difference between its prediction Q(St,At;θ) and the target value yt. A common choice for the loss function is the Mean Squared Error (MSE), averaged over a batch of transitions:
L(θ)=E[(yt−Q(St,At;θ))2]Substituting the definition of yt:
L(θ)=E[(Rt+1+γa′maxQ(St+1,a′;θ)−Q(St,At;θ))2]We then use standard gradient descent algorithms (like Adam or RMSprop) to compute the gradient of this loss function with respect to the network parameters θ, ∇θL(θ), and update the weights to minimize the error:
θ←θ−α∇θL(θ)where α is the learning rate.
This architecture allows the agent to learn Q-values even in high-dimensional state spaces by leveraging the representational power of deep neural networks. However, training this setup directly often encounters significant challenges related to correlated samples and moving target values. The next sections will introduce two fundamental techniques, Experience Replay and Target Networks, designed specifically to stabilize the training process of Deep Q-Networks.
© 2025 ApX Machine Learning