In our initial look at Monte Carlo Control, we assumed the ability to use "exploring starts." This assumption means that every possible state-action pair (s,a) has a non-zero chance of being the starting point of an episode. By starting episodes across the full range of possibilities, we could guarantee that our agent eventually gathered experience (and calculated returns) for all actions in all relevant states, allowing it to learn the optimal action-value function Q∗(s,a).
However, the exploring starts assumption is often unrealistic. In many simulations or physical systems, we don't have the luxury of initiating an episode from any arbitrary state-action pair. Consider training a robot to navigate a maze: you can probably reset it to the official start state, but you likely can't force it to begin mid-maze having just taken a specific turn. Similarly, in games, you usually start from a predefined initial state. If we only ever start from one state and always follow our current best guess for the policy (acting greedily), we might never try certain actions in certain states, leaving potential improvements undiscovered. We might find a locally optimal policy without ever realizing a much better one exists, reachable only through actions we never explored.
How can we ensure sufficient exploration to find the optimal policy if we cannot rely on exploring starts? The answer lies in modifying the policy itself to incorporate exploration. We'll focus here on on-policy methods, where the agent learns about the same policy it uses to make decisions.
The core idea is simple: instead of always choosing the action believed to be best (the greedy action), the agent should occasionally choose an action at random. This ensures that, over time, all actions get sampled from all visited states.
A common approach to achieve this in on-policy learning is to use ϵ-soft policies. An ϵ-soft policy is one that assigns a minimum probability, ∣A(s)∣ϵ, to selecting any action a in state s, where ∣A(s)∣ is the number of available actions in that state and ϵ is a small positive number (e.g., 0.1).
Specifically, for an ϵ-soft policy π:
Mathematically, the policy π(a∣s) is defined as:
π(a∣s)={1−ϵ+∣A(s)∣ϵ∣A(s)∣ϵif a=argmaxa′Q(s,a′)if a=argmaxa′Q(s,a′)This ensures that π(a∣s)>0 for all state-action pairs. The parameter ϵ controls the trade-off between exploration (trying random actions) and exploitation (choosing the best-known action). A higher ϵ means more exploration; a lower ϵ means more exploitation.
How does this fit into our Monte Carlo control loop, which follows the pattern of Generalized Policy Iteration (GPI)?
This cycle of evaluation using an ϵ-soft policy and improvement towards an ϵ-soft greedy policy continues. The process drives the policy towards an optimal ϵ-soft policy. This policy is the best possible policy within the constraint that it must explore with probability ϵ. It's optimal among the class of ϵ-soft policies.
In practice, it's common to start with a larger ϵ (encouraging exploration early on) and gradually decrease it over time. If ϵ decays towards zero, the policy approaches the truly optimal policy π∗. However, ensuring convergence requires careful management of the decay schedule and theoretical conditions (like ensuring infinite exploration of all pairs, often summarized as GLIE - Greedy in the Limit with Infinite Exploration).
By using ϵ-soft policies, on-policy Monte Carlo methods can effectively learn optimal behavior even when exploring starts are not feasible, making them applicable to a wider range of problems where learning must occur solely through ongoing interaction. The next section details the algorithm for On-Policy First-Visit MC Control using this approach.
© 2025 ApX Machine Learning