As introduced, Temporal-Difference (TD) learning allows updates to happen at each step, rather than waiting for an episode to end. The simplest TD method is called TD(0), or one-step TD, and it's used for the prediction problem: estimating the state-value function Vπ for a given policy π.
Recall that Monte Carlo methods update the value estimate V(St) for a state St based on the entire observed return Gt starting from that state:
V(St)←V(St)+α[Gt−V(St)]where Gt=Rt+1+γRt+2+γ2Rt+3+… is the actual, observed cumulative discounted reward until the episode terminates. Calculating Gt requires waiting until the end of the episode.
TD(0), on the other hand, performs an update immediately after observing the transition from state St to state St+1 upon taking an action At and receiving reward Rt+1. Instead of using the full return Gt, TD(0) uses an estimated return. This estimate is formed by combining the immediate reward Rt+1 with the current estimate of the value of the next state, V(St+1). This estimated return, Rt+1+γV(St+1), is called the TD target.
The TD(0) update rule for V(St) is:
V(St)←V(St)+α[Rt+1+γV(St+1)−V(St)]Let's break this down:
The update rule can then be written more compactly using the TD error:
V(St)←V(St)+αδtThis means we adjust the current value V(St) in the direction suggested by the TD error. If the TD target is higher than V(St), meaning the transition led to a better-than-expected outcome (based on Rt+1 and V(St+1)), we increase V(St). If it's lower, we decrease V(St).
The core idea that distinguishes TD(0) from MC methods is bootstrapping. TD(0) updates its estimate V(St) based partly on another learned estimate, V(St+1). It doesn't wait for the final outcome (Gt) but instead uses the currently available estimate of the future as a stand-in. This allows TD(0) to learn online (updating after each step) and from incomplete sequences, making it applicable to continuing tasks where episodes might never end.
Here's a summary of the algorithm for estimating Vπ≈V:
This process iteratively adjusts the value estimates based on the observed transitions and rewards, eventually converging towards the true value function Vπ under certain conditions (like appropriate decay of the learning rate α). TD(0) forms the foundation for more complex TD control methods like SARSA and Q-learning, which we will explore next.
Was this section helpful?
© 2025 ApX Machine Learning