Importance Sampling (IS) emerges as a fundamental statistical technique for evaluating a new policy using only data collected by a potentially different behavior policy . The core idea is elegant: re-weight the returns observed in the offline dataset to correct for the mismatch between the policies. If a trajectory was more likely under the evaluation policy than under the behavior policy , its return should be weighted more heavily, and vice versa.
Imagine you have a dataset consisting of trajectories, where each trajectory was generated by following the behavior policy . Our goal is to estimate the expected return of the evaluation policy , denoted as , where is the cumulative discounted reward of trajectory .
Importance Sampling allows us to estimate this expectation using samples from by multiplying the return of each trajectory by an "importance ratio" that quantifies the relative probability of that trajectory occurring under versus .
For a given trajectory , the probability of observing that sequence of states and actions under a policy (assuming deterministic environment transitions for simplicity, though it extends to stochastic ones) is: The importance ratio for the entire trajectory is the ratio of these probabilities under and : Let be the per-step importance ratio. Then the full trajectory ratio is .
The standard (or "ordinary") Importance Sampling estimator for using trajectories from is: This estimator is unbiased, meaning , which is theoretically appealing. However, its practical utility in offline RL is severely limited by several factors, primarily its potentially enormous variance.
The unbiasedness of the IS estimator comes at a steep price: variance. The variance of depends on the second moment of the weighted returns, specifically involving . The critical term here is the squared trajectory importance ratio, .
Consider the structure of . This is a product of potentially many per-step ratios.
This exponential increase in variance with the horizon length is a major barrier to applying basic IS effectively in typical RL problems, which often involve long sequences of decisions.
As the horizon length increases, the distribution of importance weights tends to become heavily skewed, with many weights close to zero and a few extremely large weights, leading to high variance in the IS estimate. (Illustration).
The variance problem is exacerbated by the nature of the behavior policy and its relationship to the evaluation policy .
"A practical limitation is that standard IS requires explicit knowledge of the probabilities for all state-action pairs encountered in the dataset. In many scenarios where offline data is collected (e.g., logs from a deployed system, human demonstrations), the exact policy that generated the data is unknown. One might try to estimate from the data (e.g., using behavior cloning), but this introduces another layer of approximation and potential error into the OPE process, potentially biasing the IS estimate."
Several variants of IS have been proposed to mitigate the variance issue, although none solve it completely:
Weighted Importance Sampling (WIS): Instead of a simple average, WIS normalizes the weighted returns by the sum of the weights: WIS is a biased estimator (the expectation is not exactly for finite , though it's consistent), but it often exhibits significantly lower variance than ordinary IS, especially when weights vary greatly. It effectively down-weights the influence of outlier trajectories with huge ratios.
Per-Decision Importance Sampling (PDIS): PDIS aims to reduce variance by applying importance correction only up to the time step of interest when estimating state or state-action values. For estimating the expected return, a common form looks at discounted importance ratios: While potentially offering variance reduction in certain settings compared to standard IS, PDIS still suffers from the compounding product of ratios.
These variants can be helpful, but they do not fundamentally eliminate the exponential variance growth associated with long horizons or significant differences between and .
In summary, while Importance Sampling provides a theoretically grounded way to perform Off-Policy Evaluation, its extreme variance and sensitivity to policy mismatch and data coverage make it often unreliable for practical Offline RL. These limitations highlight the significant challenge posed by distributional shift and motivate the development of alternative OPE methods and offline learning algorithms, such as the policy constraint and value regularization techniques we will discuss next, which explicitly aim to mitigate the problems caused by evaluating actions or states poorly represented in the offline dataset.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with