In the previous chapter on Monte Carlo (MC) methods, we learned how an agent could estimate value functions by averaging the returns observed after completing entire episodes. Think of it like judging the quality of a multi-day hike only after you've reached the final destination and tallied up all the experiences. This approach works, but it has a significant limitation: you only learn after the episode is finished. For very long or even continuous tasks (tasks that never end), waiting for a "final outcome" is impractical or impossible.
Temporal-Difference (TD) learning provides a different approach. Instead of waiting for the end of an episode to know the final return Gt, TD methods update the value estimate for a state St based on the immediate reward Rt+1 received and the current estimated value of the next state St+1.
Consider being at state St, taking action At, receiving reward Rt+1, and landing in state St+1.
This process of updating an estimate based partially on another learned estimate is called bootstrapping. TD learning bootstraps because the update to V(St) relies on the existing estimate V(St+1). It's like adjusting your estimated travel time from City A to City Z based on reaching City B and using your current estimate of the travel time from City B to City Z, rather than waiting until you actually arrive at City Z.
Comparison of update timing. MC methods wait until the episode ends (state ST) to calculate the actual return Gt and update values for states visited in that episode. TD methods update the value of a state (e.g., V(S0)) shortly after visiting it (at the next step, t+1), using the observed reward (R1) and the current estimate of the next state's value (V(S1)).
This ability to learn from each step, rather than waiting for the end of an episode, gives TD methods several advantages:
TD learning combines ideas from both Monte Carlo methods and Dynamic Programming (DP). Like MC, it learns directly from raw experience without requiring a model of the environment's dynamics (the transition probabilities p(s′,r∣s,a)). Like DP, it uses bootstrapping to update estimates based on other estimates. This blend makes TD methods central to modern reinforcement learning.
In the following sections, we will formalize this idea, starting with the TD(0) algorithm for estimating state values, and then move on to control algorithms like SARSA and Q-learning that learn action values.
© 2025 ApX Machine Learning