In the previous chapter, we explored how function approximation helps Reinforcement Learning agents generalize knowledge across states, moving beyond the limitations of tabular methods for large state spaces. While linear function approximation offers improvements, many real-world problems involve highly complex relationships between states and values that linear models struggle to capture. Consider tasks like playing video games directly from screen pixels or controlling a robot based on high-dimensional sensor inputs. The state spaces here are enormous, and the optimal policy might depend on intricate, non-linear patterns in the input.
This is where deep neural networks come into play. Neural networks excel at learning complex, hierarchical features and non-linear mappings from high-dimensional inputs. By using a neural network to approximate the action-value function Q(s,a), we can potentially learn effective policies even in environments with visually rich or highly complex state representations.
We call this approach Deep Q-Learning, and the neural network used is often referred to as a Deep Q-Network (DQN). Instead of storing Q-values in a table or using a simple linear function, we use a neural network with parameters (weights and biases) denoted collectively by θ. This network takes a state representation s as input and outputs an estimate of the Q-value for each possible action a in that state. A common architecture outputs a vector of Q-values, one entry for each action:
Q(s,⋅;θ)≈Q∗(s,⋅)
Here, Q(s,⋅;θ) represents the network's output for state s given parameters θ, which aims to approximate the true optimal action-value function Q∗(s,⋅). For example, if an agent is processing an image from a game screen (the state s), the DQN might output the expected cumulative future reward for pressing left, pressing right, jumping, or firing.
The objective, then, is to train the network, meaning we need to find the parameters θ that make the network's output Q(s,a;θ) a good approximation of the optimal Q∗(s,a). How do we do this? We can leverage the principles of Q-learning and the Bellman optimality equation.
Recall the standard Q-learning update relies on the TD target: y=R+γmaxa′Q(S′,a′). This target represents an estimate of the optimal Q-value based on the reward received (R) and the maximum estimated future value from the next state (S′). In Deep Q-Learning, we treat training the network as a supervised learning problem. For a given transition (S,A,R,S′), the network predicts the current Q-value Q(S,A;θ). We compute the target value y using the reward R and the network's own estimate of the maximum Q-value for the next state S′.
Ideally, we want our network's prediction Q(S,A;θ) to match this target y. We can define a loss function that measures the discrepancy between the prediction and the target. A common choice is the Mean Squared Error (MSE):
L(θ)=(y−Q(S,A;θ))2 L(θ)=Target Q-value (y)R+γa′maxQ(S′,a′;θ)−Predicted Q-valueQ(S,A;θ)2Our goal is to minimize this loss function with respect to the network parameters θ. We can achieve this using gradient descent algorithms, such as Stochastic Gradient Descent (SGD) or its variants like Adam. By repeatedly sampling transitions (S,A,R,S′) and adjusting the network weights θ to reduce the loss, the network gradually learns to approximate the optimal action-value function.
This combination holds immense promise. It allows us to apply Q-learning to problems that were previously intractable due to the sheer size and complexity of their state spaces. However, applying neural networks directly within the standard RL loop introduces its own set of challenges. The sequential, correlated nature of data generated by an agent interacting with an environment violates the independence assumptions often made in supervised learning. Furthermore, the fact that the target value y itself depends on the network parameters θ being updated can lead to instability during training. The following sections will discuss these challenges and introduce the core techniques developed for DQN, experience replay and target networks, which are designed to stabilize the learning process.
© 2025 ApX Machine Learning