On-policy Monte Carlo methods are approaches that evaluate or improve the same policy used to generate data from episodes. For example, on-policy control techniques, such as On-Policy First-Visit MC Control, typically employ an -greedy version of the current policy to gather experience, and this experience is then utilized to refine that same policy.
But what if we want to learn about a policy different from the one that generated the data? Consider these scenarios:
This is where off-policy learning comes in. In off-policy methods, the policy used to generate the data, called the behavior policy (often denoted by ), is different from the policy being evaluated and improved, called the target policy (often denoted by ). The behavior policy must explore sufficiently; specifically, every action taken by must have a non-zero probability of being taken by in any state where might take it. This is known as the assumption of coverage. Formally, if , then .
The fundamental challenge in off-policy learning is that the data (states visited, actions taken, rewards received) comes from interactions governed by the behavior policy , but we want to estimate the value function ( or ) for the target policy . The returns observed after time depend on the actions taken by from time onwards. If we simply averaged these returns as we did in on-policy methods, we would be estimating or , not or .
How can we correct for this mismatch? We need a way to weight the returns observed under to make them representative of what would have happened under .
The technique used to address this is importance sampling, a general statistical method for estimating properties of one distribution using samples generated from another. In our RL context, we want to estimate the expected return under the target policy , using returns generated by the behavior policy .
The core idea is to weight each return by the relative probability of the trajectory occurring under the target policy versus the behavior policy. This ratio is called the importance sampling ratio. For a sequence of states and actions occurring after time within an episode, the importance sampling ratio is:
Notice that the environment dynamics, , cancel out because we are using model-free methods (we don't know ). This simplifies the ratio considerably:
This ratio measures how much more (or less) likely the observed sequence of actions was under the target policy compared to the behavior policy .
To estimate using episodes generated by , we can collect returns following the first visit to state at time in each episode. We then weight each return by the corresponding importance sampling ratio and average these weighted returns. This gives us the ordinary importance sampling estimator:
Here, is the set of all time steps where state is visited for the first time across all episodes, and is the termination time of the episode containing time step .
Alternatively, we can use weighted importance sampling, which often has lower variance:
Weighted importance sampling provides a biased estimate but is generally preferred in practice due to its lower variance, especially when importance sampling ratios can become very large or small. Similar estimators can be constructed for the action-value function .
The agent generates experience (actions, states, rewards) by following the behavior policy
b. Importance sampling allows the agent to use this experience to learn the value function or an optimal policy corresponding to the target policyπ.
Extending this to control (finding an optimal policy) involves estimating using importance sampling. The target policy is typically chosen to be greedy with respect to the current estimate of . The behavior policy , however, must remain exploratory (e.g., -soft or random) to ensure sufficient coverage for the potentially deterministic target policy.
A common setup involves:
We then use the returns from episodes generated by , weighted by the importance sampling ratio , to update the estimates for . Since is often greedy/deterministic, will be 1 if is the greedy action and 0 otherwise. If the target policy is deterministic, the product in the numerator becomes 1 only if all actions (for to ) were the greedy choices according to . If any action taken under was non-greedy according to , the importance sampling ratio becomes 0, and that return contributes nothing to the estimate for .
While powerful, importance sampling introduces its own challenges. The primary issue is variance. If the target and behavior policies are significantly different, the importance sampling ratios can become extremely large or small. A few ratios might dominate the average, leading to high variance in the value estimates and slow convergence. This is particularly problematic for long episodes, as the ratio is a product over many steps. Weighted importance sampling helps mitigate this to some extent, but variance remains a significant concern in off-policy MC methods.
In the following sections and chapters, we will see how Temporal Difference learning methods offer alternative approaches to off-policy learning that often have lower variance than Monte Carlo methods. However, understanding the principles of off-policy learning and importance sampling with MC methods provides a solid foundation.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•