Okay, let's begin by considering a common scenario in offline reinforcement learning. You've used your static dataset D={(si,ai,ri,si′)}i=1N, collected under some behavior policy πb, to train a new target policy πe. Before deploying πe in a real-world system (which could be expensive, risky, or simply not feasible during development), you need a way to estimate its expected performance. How much cumulative reward can you expect if the agent follows πe? This is the core question addressed by Off-Policy Evaluation (OPE) in the offline setting.
Directly evaluating πe by running it isn't an option here. We must rely solely on the data D generated by πb. The fundamental challenge, as introduced earlier, is the potential distributional shift between πb and πe. The states and actions encountered in the dataset D might not accurately reflect those that would be encountered under πe. Simple averaging of rewards from the dataset won't give an unbiased estimate of πe's performance unless πe happens to be identical to πb.
The most foundational technique for OPE is Importance Sampling (IS). The idea comes from a general statistical principle: we can estimate the expectation of a function under one distribution (p) using samples drawn from another distribution (q), provided we re-weight the samples appropriately. Specifically, if x is sampled from q(x), then:
Ex∼p[f(x)]=Ex∼q[q(x)p(x)f(x)]The term p(x)/q(x) is known as the importance weight or importance ratio.
In the context of RL, we want to estimate the expected total discounted return of the evaluation policy πe, denoted J(πe). A trajectory (or episode) is a sequence τ=(s0,a0,r0,s1,a1,r1,…,sT−1,aT−1,rT−1). The probability of observing a specific trajectory τ under a policy π is given by the product of the initial state probability p(s0) and the transition probabilities p(st+1∣st,at) and policy probabilities π(at∣st):
P(τ∣π)=p(s0)t=0∏T−1π(at∣st)p(st+1∣st,at)Let R(τ)=∑t=0T−1γtrt be the total discounted return for trajectory τ. We want to compute J(πe)=Eτ∼πe[R(τ)]. Using Importance Sampling, we can estimate this using trajectories sampled under the behavior policy πb:
J(πe)=Eτ∼πb[P(τ∣πb)P(τ∣πe)R(τ)]Notice that the environment dynamics p(st+1∣st,at) and initial state distribution p(s0) cancel out in the ratio:
P(τ∣πb)P(τ∣πe)=p(s0)∏t=0T−1πb(at∣st)p(st+1∣st,at)p(s0)∏t=0T−1πe(at∣st)p(st+1∣st,at)=t=0∏T−1πb(at∣st)πe(at∣st)This product is the trajectory importance ratio, often denoted as ρτ:
ρτ=t=0∏T−1πb(at∣st)πe(at∣st)Given a dataset D containing N trajectories {τi}i=1N collected using πb, the standard Importance Sampling estimator for J(πe) is:
J^IS(πe)=N1i=1∑NρτiR(τi)This estimator is unbiased, meaning its expected value is equal to the true value J(πe), assuming we know πb exactly and πb(at∣st)>0 whenever πe(at∣st)>0.
While theoretically sound, the basic IS estimator often suffers from extremely high variance in practice. The trajectory importance ratio ρτ is a product over the length of the trajectory. If the policies πe and πb differ even slightly at each step, these differences multiply. For long trajectories (large T), the ratio ρτ can easily become astronomically large or vanishingly small.
Consider a trajectory of length T=100. If, at each step, the probability ratio πe(at∣st)/πb(at∣st) is just 1.1, the final trajectory ratio ρτ will be (1.1)100≈13,780. If the ratio is 1.2, then ρτ=(1.2)100≈8.3×107. Conversely, if the ratio is 0.9, ρτ=(0.9)100≈2.6×10−5.
This means the estimate J^IS(πe) can be dominated by a tiny fraction of trajectories with enormous weights, making the estimate highly unreliable and unstable. A single dataset might yield a wildly different estimate than another dataset collected under the exact same conditions.
Distribution of Importance Sampling (IS) weights compared to a conceptual view of Weighted Importance Sampling (WIS) weights. High variance in basic IS often manifests as a long tail of very large weights, while WIS tends to produce more concentrated weights.
Another significant requirement for basic IS is knowing the behavior policy πb(a∣s) precisely for all state-action pairs in the dataset. In many practical scenarios, the policy used to collect the data might be unknown, or only approximations might be available, adding another layer of potential error.
Several variants of IS have been developed to mitigate the variance issue, often at the cost of introducing some bias:
Per-Decision Importance Sampling (PDIS): Instead of calculating one ratio for the entire trajectory, PDIS calculates cumulative ratios step-by-step and applies them to the reward received at that step. The estimator looks like:
J^PDIS(πe)=N1i=1∑Nt=0∑Ti−1γt(k=0∏tπb(ai,k∣si,k)πe(ai,k∣si,k))ri,tPDIS often exhibits lower variance than the standard IS estimator because the weights applied to earlier rewards don't involve products over the entire trajectory length.
Weighted Importance Sampling (WIS): This is perhaps the most common practical variant. Instead of a simple average, WIS uses the importance weights to normalize the estimate:
J^WIS(πe)=∑i=1Nρτi∑i=1NρτiR(τi)Alternatively, using per-decision weights:
J^WPDIS(πe)=i=1∑Nt=0∑Ti−1∑j=1N∑k=0Tj−1wj,kwi,tγtri,twherewi,t=γtk=0∏tπb(ai,k∣si,k)πe(ai,k∣si,k)(Note: Different normalization schemes exist for WPDIS). WIS estimators are biased (because the denominator is random), but they typically have drastically lower variance than their unweighted counterparts. In practice, the reduction in variance often outweighs the introduction of bias.
Doubly Robust (DR) Estimators: These methods combine a learned model of the environment's dynamics or value functions with importance sampling. They typically take the form:
J^DR(πe)≈ModelEstimate+ISCorrectionThe idea is that the estimator will be accurate if either the learned model is accurate or the importance weights are accurate (specifically, unbiased). This provides some robustness against errors in either component. A common form uses a learned Q-function Q^(s,a) and value function V^(s)=Ea∼πe[Q^(s,a)]:
J^DR(πe)=N1i=1∑Nt=0∑Ti−1γt(ρi,t(ri,t+γV^(si,t′))−ρi,t−1Q^(si,t,ai,t))+V^(si,0)where ρi,t=∏k=0tπb(ai,k∣si,k)πe(ai,k∣si,k) and ρi,−1=1. While potentially offering lower variance and bias, DR estimators require fitting an additional model (Q^ or V^), adding complexity.
Despite these refinements, OPE in the offline setting remains challenging:
Off-policy evaluation techniques, particularly WIS and its variants, are valuable tools for getting an estimate of policy performance from offline data. However, their reliability is directly tied to the quality and coverage of the dataset and the similarity between the behavior and evaluation policies. The high variance and sensitivity of IS methods underscore the difficulties of learning purely offline and motivate the development of algorithms, discussed next, that explicitly constrain the learned policy or regularize value estimates to account for the limitations of the available data.
© 2025 ApX Machine Learning