On-policy Monte Carlo methods are approaches that evaluate or improve the same policy used to generate data from episodes. For example, on-policy control techniques, such as On-Policy First-Visit MC Control, typically employ an $\epsilon$ -greedy version of the current policy to gather experience, and this experience is then utilized to refine that same policy.

But what if we want to learn about a policy different from the one that generated the data? Consider these scenarios:

Learning from Observation: Imagine an agent learning to perform a task optimally (like driving aggressively in a race) by observing a cautious, suboptimal expert (like a regular driver). The agent generating the data follows one policy, but we want to learn the value of a different, potentially optimal, policy.
Learning from Past Experience: We might have a large dataset of episodes generated by older, possibly exploratory or varied, policies. Can we reuse this data to evaluate or improve a new target policy?
Learning Multiple Policies: An agent might want to learn about several different ways of behaving simultaneously, using data generated by a single behavior policy.

This is where off-policy learning comes in. In off-policy methods, the policy used to generate the data, called the behavior policy (often denoted by $b$ ), is different from the policy being evaluated and improved, called the target policy (often denoted by $\pi$ ). The behavior policy $b$ must explore sufficiently; specifically, every action taken by $\pi$ must have a non-zero probability of being taken by $b$ in any state where $\pi$ might take it. This is known as the assumption of coverage. Formally, if $\pi(a|s) > 0$ , then $b(a|s) > 0$ .

The Challenge: Mismatched Distributions

The fundamental challenge in off-policy learning is that the data (states visited, actions taken, rewards received) comes from interactions governed by the behavior policy $b$ , but we want to estimate the value function ( $V^\pi$ or $Q^\pi$ ) for the target policy $\pi$ . The returns $G_t$ observed after time $t$ depend on the actions taken by $b$ from time $t$ onwards. If we simply averaged these returns as we did in on-policy methods, we would be estimating $V^b$ or $Q^b$ , not $V^\pi$ or $Q^\pi$ .

How can we correct for this mismatch? We need a way to weight the returns observed under $b$ to make them representative of what would have happened under $\pi$ .

Importance Sampling: Weighting Returns

The technique used to address this is importance sampling, a general statistical method for estimating properties of one distribution using samples generated from another. In our RL context, we want to estimate the expected return under the target policy $\pi$ , using returns generated by the behavior policy $b$ .

The core idea is to weight each return $G_t$ by the relative probability of the trajectory occurring under the target policy versus the behavior policy. This ratio is called the importance sampling ratio. For a sequence of states and actions $S_t, A_t, S_{t+1}, A_{t+1}, ..., S_T$ occurring after time $t$ within an episode, the importance sampling ratio $\rho_{t:T-1}$ is:

\rho_{t:T-1} = \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k) p(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k) p(S_{k+1}|S_k, A_k)}

Notice that the environment dynamics, $p(S_{k+1}|S_k, A_k)$ , cancel out because we are using model-free methods (we don't know $p$ ). This simplifies the ratio considerably:

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

This ratio $\rho_{t:T-1}$ measures how much more (or less) likely the observed sequence of actions $A_t, ..., A_{T-1}$ was under the target policy $\pi$ compared to the behavior policy $b$ .

Off-Policy MC Prediction

To estimate $V^\pi(s)$ using episodes generated by $b$ , we can collect returns $G_t$ following the first visit to state $s$ at time $t$ in each episode. We then weight each return by the corresponding importance sampling ratio $\rho_{t:T-1}$ and average these weighted returns. This gives us the ordinary importance sampling estimator:

V^\pi(s) \approx \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|}

Here, $\mathcal{T}(s)$ is the set of all time steps $t$ where state $s$ is visited for the first time across all episodes, and $T(t)$ is the termination time of the episode containing time step $t$ .

Alternatively, we can use weighted importance sampling, which often has lower variance:

V^\pi(s) \approx \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1}}

Weighted importance sampling provides a biased estimate but is generally preferred in practice due to its lower variance, especially when importance sampling ratios can become very large or small. Similar estimators can be constructed for the action-value function $Q^\pi(s, a)$ .

The agent generates experience (actions, states, rewards) by following the behavior policy b. Importance sampling allows the agent to use this experience to learn the value function or an optimal policy corresponding to the target policy π.

Off-Policy MC Control

Extending this to control (finding an optimal policy) involves estimating $Q^\pi(s, a)$ using importance sampling. The target policy $\pi$ is typically chosen to be greedy with respect to the current estimate of $Q(s, a)$ . The behavior policy $b$ , however, must remain exploratory (e.g., $\epsilon$ -soft or random) to ensure sufficient coverage for the potentially deterministic target policy.

A common setup involves:

Target Policy $\pi(s)$ : The policy we want to learn. It's often deterministic and greedy with respect to the current Q-value estimates: $\pi(s) = \arg\max_a Q(s, a)$ .
Behavior Policy $b(s)$ : An exploratory policy (e.g., $\epsilon$ -greedy or uniformly random) used to generate episodes.

We then use the returns $G_t$ from episodes generated by $b$ , weighted by the importance sampling ratio $\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ , to update the estimates for $Q^\pi(s, a)$ . Since $\pi$ is often greedy/deterministic, $\pi(A_k|S_k)$ will be 1 if $A_k$ is the greedy action and 0 otherwise. If the target policy $\pi$ is deterministic, the product in the numerator becomes 1 only if all actions $A_k$ (for $k=t$ to $T-1$ ) were the greedy choices according to $\pi$ . If any action taken under $b$ was non-greedy according to $\pi$ , the importance sampling ratio becomes 0, and that return contributes nothing to the estimate for $Q^\pi$ .

Challenges with Importance Sampling

While powerful, importance sampling introduces its own challenges. The primary issue is variance. If the target and behavior policies are significantly different, the importance sampling ratios $\rho_{t:T-1}$ can become extremely large or small. A few ratios might dominate the average, leading to high variance in the value estimates and slow convergence. This is particularly problematic for long episodes, as the ratio is a product over many steps. Weighted importance sampling helps mitigate this to some extent, but variance remains a significant concern in off-policy MC methods.

In the following sections and chapters, we will see how Temporal Difference learning methods offer alternative approaches to off-policy learning that often have lower variance than Monte Carlo methods. However, understanding the principles of off-policy learning and importance sampling with MC methods provides a solid foundation.

Off-Policy MC Prediction and Control Intro