As we discussed in the chapter introduction, Dynamic Programming methods rely heavily on having a complete model of the environment. We needed to know the state transition probabilities p(s′,r∣s,a) to compute expected values and find optimal policies. However, in many practical problems, such a model isn't available or is too complex to define accurately. How can an agent learn to make good decisions without knowing the rules of the game beforehand?
Monte Carlo (MC) methods provide an answer by learning directly from experience. Instead of relying on a model, MC methods learn value functions and policies by interacting with the environment and observing the outcomes. The fundamental unit of experience for basic MC methods is the episode.
An episode is a sequence of interactions starting from an initial state and ending at a terminal state. Think of it as one complete playthrough of a task. For example:
Each episode consists of a sequence of states, actions, and rewards: S0,A0,R1,S1,A1,R2,…,ST, where T is the final time step and ST is the terminal state.
The core idea behind Monte Carlo methods is straightforward: they estimate value functions based on the average return observed after visiting a state (or state-action pair) over many episodes. They operate by waiting until an entire episode is completed. Only then can the actual return following each state visited during that episode be calculated.
Recall the definition of the return Gt, which is the total discounted reward from time step t onwards:
Gt=Rt+1+γRt+2+γ2Rt+3+⋯+γT−t−1RTHere, γ is the discount factor (0≤γ≤1), and T is the terminal time step of the episode.
Because MC methods require the full sequence of rewards until the episode terminates to calculate Gt, they can only be applied directly to episodic tasks, that is, tasks that are guaranteed to eventually terminate.
Once an episode finishes, we have a sample sequence of (St,At,Rt+1) tuples. We can then go back and calculate the observed return Gt for every time step t within that episode. If we want to estimate the state-value function Vπ(s), we look at all the times state s was visited across many episodes. For each visit, we calculate the return Gt that followed that visit. The estimate Vπ(s) is simply the average of these observed returns. Similarly, to estimate the action-value function Qπ(s,a), we average the returns observed after taking action a in state s.
This process relies on the law of large numbers: as we gather more and more episodes (more samples), the average of the sample returns converges to the true expected return, which is the definition of the value function.
This approach contrasts sharply with Dynamic Programming. DP methods use the environment's model (p(s′,r∣s,a)) to bootstrap, updating value estimates based on other value estimates one step ahead. MC methods, on the other hand, do not bootstrap. They use the actual, complete return Gt observed from experience. This independence from a model is a significant advantage, but it also means we must wait until the end of an episode to make any updates.
In the following sections, we will examine how this principle of learning from complete episodes is used for both predicting values (estimating Vπ or Qπ for a given policy π) and for finding optimal policies (control).
© 2025 ApX Machine Learning