Dynamic Programming methods rely heavily on having a complete model of the environment. These methods require knowledge of state transition probabilities to compute expected values and find optimal policies. However, in many practical problems, such a model isn't available or is too complex to define accurately. How can an agent learn to make good decisions without knowing the rules of the game beforehand?
Monte Carlo (MC) methods provide an answer by learning directly from experience. Instead of relying on a model, MC methods learn value functions and policies by interacting with the environment and observing the outcomes. The fundamental unit of experience for basic MC methods is the episode.
An episode is a sequence of interactions starting from an initial state and ending at a terminal state. Think of it as one complete playthrough of a task. For example:
Each episode consists of a sequence of states, actions, and rewards: , where is the final time step and is the terminal state.
The core idea behind Monte Carlo methods is straightforward: they estimate value functions based on the average return observed after visiting a state (or state-action pair) over many episodes. They operate by waiting until an entire episode is completed. Only then can the actual return following each state visited during that episode be calculated.
Recall the definition of the return , which is the total discounted reward from time step onwards:
Here, is the discount factor (), and is the terminal time step of the episode.
Because MC methods require the full sequence of rewards until the episode terminates to calculate , they can only be applied directly to episodic tasks, that is, tasks that are guaranteed to eventually terminate.
Once an episode finishes, we have a sample sequence of tuples. We can then go back and calculate the observed return for every time step within that episode. If we want to estimate the state-value function , we look at all the times state was visited across many episodes. For each visit, we calculate the return that followed that visit. The estimate is simply the average of these observed returns. Similarly, to estimate the action-value function , we average the returns observed after taking action in state .
This process relies on the law of large numbers: as we gather more and more episodes (more samples), the average of the sample returns converges to the true expected return, which is the definition of the value function.
This approach contrasts sharply with Dynamic Programming. DP methods use the environment's model () to bootstrap, updating value estimates based on other value estimates one step ahead. MC methods, on the other hand, do not bootstrap. They use the actual, complete return observed from experience. This independence from a model is a significant advantage, but it also means we must wait until the end of an episode to make any updates.
In the following sections, we will examine how this principle of learning from complete episodes is used for both predicting values (estimating or for a given policy ) and for finding optimal policies (control).
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•