Having reviewed the fundamentals of MDPs, value functions, TD learning, and the necessity of function approximation, we now confront a significant challenge that arises when certain elements are combined: the "Deadly Triad". This term refers to the potential for instability and divergence when simultaneously using:
- Function Approximation: Using parameterized functions (like linear combinations of features or deep neural networks) to represent value functions (V(s;θ) or Q(s,a;θ)) instead of tabular methods. This is essential for large or continuous state/action spaces.
- Bootstrapping: Updating estimates based on subsequent estimates. TD methods like Q-Learning and SARSA perform bootstrapping, as the update target includes the estimated value of the next state (e.g., r+γmaxa′Q(s′,a′;θ) in Q-learning). Dynamic programming methods also bootstrap.
- Off-Policy Learning: Learning the value function or policy for a target policy π using data generated from a different behavior policy b. Q-learning is inherently off-policy because it learns the value of the greedy policy (maxa′Q(s′,a′)) regardless of which action was actually taken by the behavior policy that generated the transition (s,a,r,s′).
While each of these components offers significant advantages, function approximation for generalization, bootstrapping for computational and data efficiency compared to Monte Carlo methods, and off-policy learning for flexibility and learning from logged data or exploration policies, their combination can lead to unpredictable and unstable learning dynamics. The value function estimates can oscillate wildly or diverge towards infinity.
Why the Instability?
The core issue lies in the interaction between these three elements. Let's break down how they contribute to potential divergence, particularly in the context of Q-learning with a function approximator q^(s,a;θ):
- Function Approximation Errors: Function approximators generalize but introduce errors. An update to the value estimate for a specific state-action pair (s,a) using parameters θ will implicitly change the estimated values for other, potentially unrelated, state-action pairs due to shared parameters.
- Off-Policy Data Distribution: The behavior policy b might explore parts of the state-action space differently than the target policy π. When we use data (s,a,r,s′) collected under b to update q^(s,a;θ) towards a target derived from π (like r+γmaxa′q^(s′,a′;θ)), we are evaluating our target policy on a distribution of states and actions it might not typically encounter if it were actually executing. The function approximator might then be forced to extrapolate values for these "off-policy" actions, leading to potentially large errors in the TD target.
- Bootstrapping Amplifies Errors: The TD update relies on the current estimate q^(s′,a′;θ) to form the target. If this estimate is inaccurate (due to function approximation error or extrapolation from off-policy data), the error is "bootstrapped" into the update for q^(s,a;θ).
θ←θ+α[r+γa′maxq^(s′,a′;θ)−q^(s,a;θ)]∇θq^(s,a;θ)
The max operator in Q-learning can be particularly problematic, as it actively seeks out the largest (potentially overestimated) Q-value for the next state. If the function approximator erroneously assigns a high value to an action a′ rarely or never taken by the behavior policy at state s′, this overestimated value will be used in the target, potentially increasing the value estimate for q^(s,a;θ). This creates a feedback loop where approximation errors and off-policy extrapolation are amplified by bootstrapping, potentially leading to divergence.
Consider a simple visualization of this feedback loop:
The interaction between function approximation, bootstrapping, and off-policy data can create a cycle where approximation errors are incorporated into bootstrapped targets, leading to flawed updates that can further increase errors, potentially causing value estimates to diverge.
This instability was a significant obstacle in early attempts to combine reinforcement learning with powerful non-linear function approximators like neural networks. It highlighted that simply replacing the table in tabular Q-learning with a neural network often fails dramatically. Understanding the Deadly Triad is fundamental because many advanced techniques discussed in subsequent chapters, particularly Deep Q-Networks (DQN) and its variants, were specifically developed to introduce mechanisms that stabilize learning in the face of this challenge. Techniques like experience replay and target networks, which we will cover next, are direct responses to the problems posed by the Deadly Triad.