In the quest to evaluate a new policy πe using only data collected by a potentially different behavior policy πb, Importance Sampling (IS) emerges as a fundamental statistical technique. The core idea is elegant: re-weight the returns observed in the offline dataset to correct for the mismatch between the policies. If a trajectory was more likely under the evaluation policy πe than under the behavior policy πb, its return should be weighted more heavily, and vice versa.
Imagine you have a dataset D={τ1,τ2,...,τN} consisting of N trajectories, where each trajectory τ=(s0,a0,r0,s1,a1,r1,...,sT,aT,rT) was generated by following the behavior policy πb. Our goal is to estimate the expected return of the evaluation policy πe, denoted as J(πe)=Eτ∼πe[R(τ)], where R(τ)=∑t=0Tγtrt is the cumulative discounted reward of trajectory τ.
Importance Sampling allows us to estimate this expectation using samples from πb by multiplying the return of each trajectory R(τi) by an "importance ratio" that quantifies the relative probability of that trajectory occurring under πe versus πb.
For a given trajectory τ, the probability of observing that sequence of states and actions under a policy π (assuming deterministic environment transitions for simplicity, though it extends to stochastic ones) is: P(τ∣π)=p(s0)∏t=0Tπ(at∣st)p(st+1∣st,at) The importance ratio for the entire trajectory is the ratio of these probabilities under πe and πb: ρ0:T(τ)=P(τ∣πb)P(τ∣πe)=p(s0)∏t=0Tπb(at∣st)p(st+1∣st,at)p(s0)∏t=0Tπe(at∣st)p(st+1∣st,at)=∏t=0Tπb(at∣st)πe(at∣st) Let ρt=πb(at∣st)πe(at∣st) be the per-step importance ratio. Then the full trajectory ratio is ρ0:T=∏t=0Tρt.
The standard (or "ordinary") Importance Sampling estimator for J(πe) using N trajectories from πb is: J^IS(πe)=N1∑i=1Nρ0:T(i)R(τi) This estimator is unbiased, meaning E[J^IS(πe)]=J(πe), which is theoretically appealing. However, its practical utility in offline RL is severely limited by several factors, primarily its potentially enormous variance.
The unbiasedness of the IS estimator comes at a steep price: variance. The variance of J^IS(πe) depends on the second moment of the weighted returns, specifically involving Eτ∼πb[(ρ0:T(τ))2R(τ)2]. The critical term here is the squared trajectory importance ratio, (ρ0:T)2.
Consider the structure of ρ0:T=∏t=0Tρt. This is a product of potentially many per-step ratios.
This exponential increase in variance with the horizon length is a major barrier to applying basic IS effectively in typical RL problems, which often involve long sequences of decisions.
As the horizon length increases, the distribution of importance weights tends to become heavily skewed, with many weights close to zero and a few extremely large weights, leading to high variance in the IS estimate. (Conceptual illustration).
The variance problem is exacerbated by the nature of the behavior policy πb and its relationship to the evaluation policy πe.
A practical limitation is that standard IS requires explicit knowledge of the probabilities πb(at∣st) for all state-action pairs encountered in the dataset. In many real-world scenarios where offline data is collected (e.g., logs from a deployed system, human demonstrations), the exact policy that generated the data is unknown. One might try to estimate πb from the data (e.g., using behavior cloning), but this introduces another layer of approximation and potential error into the OPE process, potentially biasing the IS estimate.
Several variants of IS have been proposed to mitigate the variance issue, although none solve it completely:
Weighted Importance Sampling (WIS): Instead of a simple average, WIS normalizes the weighted returns by the sum of the weights: J^WIS(πe)=∑i=1Nρ0:T(i)∑i=1Nρ0:T(i)R(τi) WIS is a biased estimator (the expectation is not exactly J(πe) for finite N, though it's consistent), but it often exhibits significantly lower variance than ordinary IS, especially when weights vary greatly. It effectively down-weights the influence of outlier trajectories with huge ratios.
Per-Decision Importance Sampling (PDIS): PDIS aims to reduce variance by applying importance correction only up to the time step of interest when estimating state or state-action values. For estimating the expected return, a common form looks at discounted importance ratios: J^PDIS(πe)=N1∑i=1N∑t=0Tγtρ0:t(i)rt(i) While potentially offering variance reduction in certain settings compared to standard IS, PDIS still suffers from the compounding product of ratios.
These variants can be helpful, but they do not fundamentally eliminate the exponential variance growth associated with long horizons or significant differences between πe and πb.
In summary, while Importance Sampling provides a theoretically grounded way to perform Off-Policy Evaluation, its extreme variance and sensitivity to policy mismatch and data coverage make it often unreliable for practical Offline RL. These limitations highlight the significant challenge posed by distributional shift and motivate the development of alternative OPE methods and offline learning algorithms, such as the policy constraint and value regularization techniques we will discuss next, which explicitly aim to mitigate the problems caused by evaluating actions or states poorly represented in the offline dataset.
© 2025 ApX Machine Learning