Temporal-Difference (TD) learning allows updates to happen at each step, rather than waiting for an episode to end. The simplest TD method is called TD(0), or one-step TD, and it's used for the prediction problem: estimating the state-value function $V_\pi$ for a given policy $\pi$.Recall that Monte Carlo methods update the value estimate $V(S_t)$ for a state $S_t$ based on the entire observed return $G_t$ starting from that state: $$ V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] $$ where $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$ is the actual, observed cumulative discounted reward until the episode terminates. Calculating $G_t$ requires waiting until the end of the episode.TD(0), on the other hand, performs an update immediately after observing the transition from state $S_t$ to state $S_{t+1}$ upon taking an action $A_t$ and receiving reward $R_{t+1}$. Instead of using the full return $G_t$, TD(0) uses an estimated return. This estimate is formed by combining the immediate reward $R_{t+1}$ with the current estimate of the value of the next state, $V(S_{t+1})$. This estimated return, $R_{t+1} + \gamma V(S_{t+1})$, is called the TD target.The TD(0) update rule for $V(S_t)$ is: $$ V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] $$Let's break this down:$V(S_t)$: The current estimate of the value of the state visited at time $t$.$\alpha$: The learning rate, a small positive number determining the step size of the update.$R_{t+1}$: The reward received immediately after transitioning from $S_t$.$\gamma$: The discount factor, weighting future rewards.$V(S_{t+1})$: The current estimate of the value of the next state, $S_{t+1}$.$R_{t+1} + \gamma V(S_{t+1})$: The TD target. This is the important part. It's an estimate of the return from state $S_t$. It uses the actual immediate reward $R_{t+1}$ and then relies on the existing value estimate $V(S_{t+1})$ for the value of all subsequent states.$[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$: This term is known as the TD error, often denoted $\delta_t$. $$ \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) $$ The TD error measures the difference between the estimated value of $S_t$ (the TD target) and the current value estimate $V(S_t)$. It represents how "surprising" the outcome of the transition was compared to the current estimate.The update rule can then be written more compactly using the TD error: $$ V(S_t) \leftarrow V(S_t) + \alpha \delta_t $$ This means we adjust the current value $V(S_t)$ in the direction suggested by the TD error. If the TD target is higher than $V(S_t)$, meaning the transition led to a better-than-expected outcome (based on $R_{t+1}$ and $V(S_{t+1})$), we increase $V(S_t)$. If it's lower, we decrease $V(S_t)$.The Bootstrapping PropertyThe core idea that distinguishes TD(0) from MC methods is bootstrapping. TD(0) updates its estimate $V(S_t)$ based partly on another learned estimate, $V(S_{t+1})$. It doesn't wait for the final outcome ($G_t$) but instead uses the currently available estimate of the future as a stand-in. This allows TD(0) to learn online (updating after each step) and from incomplete sequences, making it applicable to continuing tasks where episodes might never end.TD(0) Prediction AlgorithmHere's a summary of the algorithm for estimating $V_\pi \approx V$:Initialize $V(s)$ arbitrarily for all states $s$ (e.g., $V(s) = 0$).Repeat for each episode: a. Initialize $S$ (first state of the episode). b. Repeat for each step of the episode: i. Choose action $A$ according to the policy $\pi$ for state $S$. ii. Take action $A$, observe reward $R$ and next state $S'$. iii. Calculate the TD error: $\delta \leftarrow R + \gamma V(S') - V(S)$. iv. Update the value function: $V(S) \leftarrow V(S) + \alpha \delta$. v. Update the state: $S \leftarrow S'$. c. Until $S$ is a terminal state.This process iteratively adjusts the value estimates based on the observed transitions and rewards, eventually converging towards the true value function $V_\pi$ under certain conditions (like appropriate decay of the learning rate $\alpha$). TD(0) forms the foundation for more complex TD control methods like SARSA and Q-learning, which we will explore next.