So far, we've explored on-policy Monte Carlo methods. These methods evaluate or improve the same policy that is used to generate the data (the episodes). For instance, in On-Policy First-Visit MC Control, we used an ϵ-greedy version of the current policy to collect experience and then used that experience to improve the same policy.
But what if we want to learn about a policy different from the one that generated the data? Consider these scenarios:
This is where off-policy learning comes in. In off-policy methods, the policy used to generate the data, called the behavior policy (often denoted by b), is different from the policy being evaluated and improved, called the target policy (often denoted by π). The behavior policy b must explore sufficiently; specifically, every action taken by π must have a non-zero probability of being taken by b in any state where π might take it. This is known as the assumption of coverage. Formally, if π(a∣s)>0, then b(a∣s)>0.
The fundamental challenge in off-policy learning is that the data (states visited, actions taken, rewards received) comes from interactions governed by the behavior policy b, but we want to estimate the value function (Vπ or Qπ) for the target policy π. The returns Gt observed after time t depend on the actions taken by b from time t onwards. If we simply averaged these returns as we did in on-policy methods, we would be estimating Vb or Qb, not Vπ or Qπ.
How can we correct for this mismatch? We need a way to weight the returns observed under b to make them representative of what would have happened under π.
The technique used to address this is importance sampling, a general statistical method for estimating properties of one distribution using samples generated from another. In our RL context, we want to estimate the expected return under the target policy π, using returns generated by the behavior policy b.
The core idea is to weight each return Gt by the relative probability of the trajectory occurring under the target policy versus the behavior policy. This ratio is called the importance sampling ratio. For a sequence of states and actions St,At,St+1,At+1,...,ST occurring after time t within an episode, the importance sampling ratio ρt:T−1 is:
ρt:T−1=∏k=tT−1b(Ak∣Sk)p(Sk+1∣Sk,Ak)∏k=tT−1π(Ak∣Sk)p(Sk+1∣Sk,Ak)Notice that the environment dynamics, p(Sk+1∣Sk,Ak), cancel out because we are using model-free methods (we don't know p). This simplifies the ratio considerably:
ρt:T−1=k=t∏T−1b(Ak∣Sk)π(Ak∣Sk)This ratio ρt:T−1 measures how much more (or less) likely the observed sequence of actions At,...,AT−1 was under the target policy π compared to the behavior policy b.
To estimate Vπ(s) using episodes generated by b, we can collect returns Gt following the first visit to state s at time t in each episode. We then weight each return by the corresponding importance sampling ratio ρt:T−1 and average these weighted returns. This gives us the ordinary importance sampling estimator:
Vπ(s)≈∣T(s)∣∑t∈T(s)ρt:T(t)−1GtHere, T(s) is the set of all time steps t where state s is visited for the first time across all episodes, and T(t) is the termination time of the episode containing time step t.
Alternatively, we can use weighted importance sampling, which often has lower variance:
Vπ(s)≈∑t∈T(s)ρt:T(t)−1∑t∈T(s)ρt:T(t)−1GtWeighted importance sampling provides a biased estimate but is generally preferred in practice due to its lower variance, especially when importance sampling ratios can become very large or small. Similar estimators can be constructed for the action-value function Qπ(s,a).
The agent generates experience (actions, states, rewards) by following the behavior policy
b
. Importance sampling allows the agent to use this experience to learn the value function or an optimal policy corresponding to the target policyπ
.
Extending this to control (finding an optimal policy) involves estimating Qπ(s,a) using importance sampling. The target policy π is typically chosen to be greedy with respect to the current estimate of Q(s,a). The behavior policy b, however, must remain exploratory (e.g., ϵ-soft or random) to ensure sufficient coverage for the potentially deterministic target policy.
A common setup involves:
We then use the returns Gt from episodes generated by b, weighted by the importance sampling ratio ρt:T−1=∏k=tT−1b(Ak∣Sk)π(Ak∣Sk), to update the estimates for Qπ(s,a). Since π is often greedy/deterministic, π(Ak∣Sk) will be 1 if Ak is the greedy action and 0 otherwise. If the target policy π is deterministic, the product in the numerator becomes 1 only if all actions Ak (for k=t to T−1) were the greedy choices according to π. If any action taken under b was non-greedy according to π, the importance sampling ratio becomes 0, and that return contributes nothing to the estimate for Qπ.
While powerful, importance sampling introduces its own challenges. The primary issue is variance. If the target and behavior policies are significantly different, the importance sampling ratios ρt:T−1 can become extremely large or small. A few ratios might dominate the average, leading to high variance in the value estimates and slow convergence. This is particularly problematic for long episodes, as the ratio is a product over many steps. Weighted importance sampling helps mitigate this to some extent, but variance remains a significant concern in off-policy MC methods.
In the following sections and chapters, we will see how Temporal Difference learning methods offer alternative approaches to off-policy learning that often have lower variance than Monte Carlo methods. However, understanding the principles of off-policy learning and importance sampling with MC methods provides a solid foundation.
© 2025 ApX Machine Learning