As introduced, Temporal-Difference (TD) learning allows updates to happen at each step, rather than waiting for an episode to end. The simplest TD method is called TD(0), or one-step TD, and it's used for the prediction problem: estimating the state-value function Vπ for a given policy π.
Recall that Monte Carlo methods update the value estimate V(St) for a state St based on the entire observed return Gt starting from that state:
V(St)←V(St)+α[Gt−V(St)]
where Gt=Rt+1+γRt+2+γ2Rt+3+… is the actual, observed cumulative discounted reward until the episode terminates. Calculating Gt requires waiting until the end of the episode.
TD(0), on the other hand, performs an update immediately after observing the transition from state St to state St+1 upon taking an action At and receiving reward Rt+1. Instead of using the full return Gt, TD(0) uses an estimated return. This estimate is formed by combining the immediate reward Rt+1 with the current estimate of the value of the next state, V(St+1). This estimated return, Rt+1+γV(St+1), is called the TD target.
The TD(0) update rule for V(St) is:
V(St)←V(St)+α[Rt+1+γV(St+1)−V(St)]
Let's break this down:
- V(St): The current estimate of the value of the state visited at time t.
- α: The learning rate, a small positive number determining the step size of the update.
- Rt+1: The reward received immediately after transitioning from St.
- γ: The discount factor, weighting future rewards.
- V(St+1): The current estimate of the value of the next state, St+1.
- Rt+1+γV(St+1): The TD target. This is the crucial part. It's an estimate of the return from state St. It uses the actual immediate reward Rt+1 and then relies on the existing value estimate V(St+1) for the value of all subsequent states.
- [Rt+1+γV(St+1)−V(St)]: This term is known as the TD error, often denoted δt.
δt=Rt+1+γV(St+1)−V(St)
The TD error measures the difference between the estimated value of St (the TD target) and the current value estimate V(St). It represents how "surprising" the outcome of the transition was compared to the current estimate.
The update rule can then be written more compactly using the TD error:
V(St)←V(St)+αδt
This means we adjust the current value V(St) in the direction suggested by the TD error. If the TD target is higher than V(St), meaning the transition led to a better-than-expected outcome (based on Rt+1 and V(St+1)), we increase V(St). If it's lower, we decrease V(St).
The Bootstrapping Property
The core idea that distinguishes TD(0) from MC methods is bootstrapping. TD(0) updates its estimate V(St) based partly on another learned estimate, V(St+1). It doesn't wait for the final outcome (Gt) but instead uses the currently available estimate of the future as a stand-in. This allows TD(0) to learn online (updating after each step) and from incomplete sequences, making it applicable to continuing tasks where episodes might never end.
TD(0) Prediction Algorithm
Here's a summary of the algorithm for estimating Vπ≈V:
- Initialize V(s) arbitrarily for all states s (e.g., V(s)=0).
- Repeat for each episode:
a. Initialize S (first state of the episode).
b. Repeat for each step of the episode:
i. Choose action A according to the policy π for state S.
ii. Take action A, observe reward R and next state S′.
iii. Calculate the TD error: δ←R+γV(S′)−V(S).
iv. Update the value function: V(S)←V(S)+αδ.
v. Update the state: S←S′.
c. Until S is a terminal state.
This process iteratively adjusts the value estimates based on the observed transitions and rewards, eventually converging towards the true value function Vπ under certain conditions (like appropriate decay of the learning rate α). TD(0) forms the foundation for more complex TD control methods like SARSA and Q-learning, which we will explore next.