In the previous chapter, we examined Monte Carlo methods, which update value estimates only after an entire episode completes. This requires waiting until the final outcome is known before any learning occurs for that episode. Temporal-Difference (TD) learning offers a different approach, enabling updates after each step.
TD methods learn directly from experience, similar to MC methods. However, unlike MC, they update value estimates based partially on other currently learned estimates, without waiting for the episode's final outcome. This technique is known as bootstrapping. It allows TD methods to learn from incomplete episodes and often converges faster than MC methods in practice.
This chapter introduces the fundamentals of TD learning. We will cover:
By the end of this chapter, you will understand how TD methods operate and be able to implement core TD algorithms for prediction and control problems.
5.1 Learning from Incomplete Episodes
5.2 TD(0) Prediction: Estimating Vπ
5.3 Advantages of TD Learning over MC
5.4 SARSA: On-Policy TD Control
5.5 Q-Learning: Off-Policy TD Control
5.6 Comparing SARSA and Q-Learning
5.7 Expected SARSA
5.8 Hands-on Practical: Implementing Q-Learning
© 2025 ApX Machine Learning