As we explore Monte Carlo methods, which learn from sampled episodes of experience, a fundamental question arises: which policy are we actually learning about? The answer distinguishes between two major categories of RL algorithms: on-policy and off-policy learning.
Understanding this distinction is significant because it affects how we can gather experience and what we can learn from it. It influences algorithm design, stability, and flexibility, particularly when balancing exploration (trying new actions) and exploitation (using known good actions).
In on-policy methods, the agent learns about the value function or policy it is currently following. Think of it as learning "on the job." The policy used to generate the interaction data (the episodes) is the same policy that the agent is trying to evaluate and improve.
Let's say the agent is following a policy π. It interacts with the environment using actions chosen according to π. It collects episodes like (S0,A0,R1,S1,A1,R2,...,ST). The returns calculated from these episodes are then used to estimate Vπ(s) or Qπ(s,a), the value functions for this specific policy π. Subsequently, π is improved based on these updated value estimates, perhaps by making it more greedy with respect to the learned action values. The next batch of experience is then generated using this improved policy.
Characteristics of On-Policy Learning:
A typical example within Monte Carlo is On-Policy First-Visit MC Control, which we will detail shortly. It uses returns generated by an ϵ-soft policy to improve that same ϵ-soft policy.
In off-policy methods, the agent learns about a target policy π using data generated from following a different behavior policy μ. Think of this as learning how to perform a task optimally (π) by observing someone else, possibly less skilled or more exploratory (μ), perform it.
The target policy π is usually the policy the agent ultimately wants to learn (e.g., the deterministic greedy policy that always chooses the action with the highest estimated value). The behavior policy μ, however, is the one used to actually interact with the environment and generate episodes. μ might be more exploratory than π to ensure it gathers data about a wider range of actions and states.
Why use Off-Policy Learning?
The Challenge: Since the data comes from μ, not π, the distribution of states visited and actions taken doesn't directly reflect what would happen under π. Actions preferred by π might be rare in the data generated by μ, and vice versa. Directly averaging returns from μ's episodes would yield estimates for Vμ or Qμ, not Vπ or Qπ.
The Solution (Conceptual): Off-policy methods need a way to correct for this mismatch. They often use techniques like Importance Sampling, which involves weighting the observed returns based on the relative probability of the experienced trajectory occurring under the target policy π versus the behavior policy μ. This adjusts the contribution of each episode to account for the difference in policies. We'll touch upon Importance Sampling in the context of Off-Policy MC later in this chapter.
Characteristics of Off-Policy Learning:
Q-learning, which we will encounter in the next chapter on Temporal-Difference learning, is a well-known off-policy algorithm.
Diagram comparing On-Policy and Off-Policy learning flows. On-policy uses the same policy (π) for interaction and learning. Off-policy uses a behavior policy (μ) for interaction to learn about a different target policy (π), requiring a correction step.
Feature | On-Policy Learning | Off-Policy Learning |
---|---|---|
Data Source | Policy being learned (π) | Separate behavior policy (μ) |
Target Policy | Policy being learned (π) | Often a different policy (π, e.g., greedy) |
Exploration | Built into the learning policy (π) | Handled by behavior policy (μ) |
Key Challenge | Balancing exploration/exploitation | Correcting for policy mismatch (variance) |
Flexibility | Lower (learns about what it does) | Higher (can learn optimal policy while exploring, use historical data) |
Example | On-Policy MC, SARSA | Off-Policy MC, Q-Learning |
In essence, the choice between on-policy and off-policy methods depends on the problem requirements. If you need to learn the value of the specific behavior strategy being used (including its exploration), on-policy is suitable. If you want to learn an optimal policy regardless of the exploration strategy used during data collection, or if you want to learn from data generated by a different agent or policy, off-policy methods provide the necessary mechanisms. As we proceed with Monte Carlo control, we'll see how this distinction plays out in algorithm design.
© 2025 ApX Machine Learning