While Dynamic Programming methods like Value and Policy Iteration provide a formal way to find optimal policies in Markov Decision Processes (MDPs), they rely on a complete model of the environment, knowing all transition probabilities P(s′∣s,a) and rewards R(s,a,s′). In many practical scenarios, such a model is unavailable. This is where Temporal Difference (TD) learning methods become indispensable. TD learning allows agents to learn directly from raw experience, interacting with the environment (or using logged interactions) without needing explicit knowledge of its dynamics.
TD methods blend ideas from both Monte Carlo (MC) methods and Dynamic Programming (DP). Like MC, TD methods learn from samples of experience. However, like DP, TD methods bootstrap: they update estimates based on other learned estimates, rather than waiting for the final outcome of an episode as MC methods do. This typically leads to faster learning, especially in long or continuous tasks.
The simplest TD method, often called TD(0) or one-step TD, focuses on the prediction problem: estimating the state-value function Vπ(s) for a given policy π. After taking action At in state St and observing the reward Rt+1 and the next state St+1, TD(0) updates the value estimate for state St using the following rule:
V(St)←V(St)+α[Rt+1+γV(St+1)−V(St)]Let's break down this update:
The TD(0) update adjusts the current value estimate V(St) in the direction suggested by the TD error. Because it uses V(St+1) (an estimate) in its target, it introduces some bias compared to MC methods (which use the actual full return Gt). However, TD updates typically have significantly lower variance than MC updates, as the target depends only on the next reward and state value estimate, not the entire sequence of future rewards in an episode.
To move from prediction to control (finding an optimal policy), we need to estimate action-value functions, Q(s,a). SARSA is an on-policy TD control algorithm that learns Q(s,a) directly from experience. The name "SARSA" stands for the tuple of information involved in the update: (St,At,Rt+1,St+1,At+1).
The agent starts in state St, chooses action At (according to its current policy, often ϵ-greedy), observes reward Rt+1 and next state St+1, and then chooses the next action At+1 (again, according to its policy). The SARSA update rule is then applied to the Q-value of the initial state-action pair (St,At):
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)−Q(St,At)]Notice the similarity to the TD(0) update. The TD target is now Rt+1+γQ(St+1,At+1), using the Q-value of the next state and the next action actually taken. Because the update depends on the action At+1 chosen by the current policy, SARSA is an on-policy algorithm. It learns the value of the policy it is currently following, including the effects of its exploration strategy (like ϵ-greedy choices).
The SARSA update process relies on the sequence of state, action, reward, next state, and importantly, the next action selected by the policy.
Q-Learning is another fundamental TD control algorithm, but with a significant difference: it is off-policy. This means it learns the optimal action-value function Q∗(s,a) directly, regardless of the policy being followed to generate the experience (the behavior policy), as long as that policy explores sufficiently.
The Q-Learning update rule is:
Q(St,At)←Q(St,At)+α[Rt+1+γa′maxQ(St+1,a′)−Q(St,At)]The crucial difference lies in the TD target: Rt+1+γmaxa′Q(St+1,a′). Instead of using the Q-value of the next action actually taken (At+1), Q-Learning uses the Q-value of the action a′ that maximizes the Q-function estimate in the next state St+1. This implicitly assumes that the optimal (greedy) action will be taken from state St+1 onwards, effectively learning about the optimal policy π∗ while potentially following a different, more exploratory behavior policy (like ϵ-greedy).
This off-policy nature makes Q-Learning powerful. It can learn from past experience generated by different policies, or even from observing human demonstrators. However, combining off-policy learning, bootstrapping (inherent in TD methods), and function approximation (necessary for large state spaces) can lead to instability, a phenomenon known as the 'Deadly Triad', which we will discuss shortly.
TD methods (TD(0), SARSA, Q-Learning) provide efficient, model-free ways to learn value functions directly from experience. They update estimates based on other estimates (bootstrapping), allowing learning before the end of an episode. SARSA is on-policy, learning the value of the behavior policy, while Q-Learning is off-policy, directly estimating the optimal value function.
These tabular methods are foundational. However, their reliance on lookup tables for value functions restricts them to problems with relatively small, discrete state and action spaces. To handle the vast state spaces encountered in complex problems (like robotics control from camera images or playing complex games), we need to generalize these value functions using function approximators, such as linear functions or deep neural networks. The next section introduces this concept, setting the stage for the deep reinforcement learning techniques that form the core of this course and highlighting the challenges that arise when combining TD learning with powerful function approximators.
© 2025 ApX Machine Learning