Alright, let's put the concepts of Monte Carlo prediction and the need for exploration together to create a control algorithm. Remember, our goal in RL control is not just to evaluate a given policy, but to find an optimal policy. In the context of MC methods, this means finding the policy that maximizes the expected return based on the experiences gathered from episodes.
We'll focus on an on-policy method, meaning the policy we are learning about (π) is the same policy used to generate the behavior (the episodes). Specifically, we'll detail the First-Visit MC Control algorithm. As the name suggests, it updates the value estimate for a state-action pair (s,a) only based on the return following the first time that pair was visited within an episode.
To improve a policy, we generally need estimates of action values, Qπ(s,a), rather than just state values, Vπ(s). Why? Because to decide if changing an action in a state s is beneficial, we need to know the expected return of taking alternative actions a′ from that state s. Knowing Vπ(s) alone isn't sufficient.
So, our MC prediction task shifts from estimating Vπ(s) to estimating Qπ(s,a). The core idea remains the same: average the returns observed after visiting a state-action pair. For First-Visit MC, we average the returns following the first occurrence of (s,a) in each episode.
The update rule for the Q-value estimation looks very similar to the V-value estimation:
This process gives us the policy evaluation part using MC. Now, how do we achieve policy improvement?
Similar to Dynamic Programming, we can improve the policy by making it greedy with respect to the current action-value estimates. For a given state s, the improved policy would choose the action a that maximizes Q(s,a):
π′(s)=aargmaxQ(s,a)However, there's a significant catch if we only act greedily based on our current estimates. Monte Carlo methods learn from episodes generated by following the current policy. If the policy becomes deterministic and greedy early on, it might stop exploring certain actions altogether. If an action a is never chosen in state s, we will never get returns for the pair (s,a), and our estimate Q(s,a) will never improve. We might converge to a suboptimal policy simply because we didn't explore enough to find the truly optimal actions.
To balance exploration (trying new actions) and exploitation (using the best-known actions), we use strategies that ensure continued exploration. A common and simple approach is the ϵ-greedy policy.
An ϵ-greedy policy behaves greedily most of the time but selects a random action with a small probability ϵ.
This ensures that, eventually, every state-action pair has a non-zero probability of being visited, allowing our Q-value estimates to continue improving for all actions. The policy we learn about and the policy we use to generate episodes are both this ϵ-greedy policy, hence the term "on-policy".
Now we can combine MC estimation of Q-values with ϵ-greedy policy improvement. This follows the pattern of Generalized Policy Iteration (GPI), where evaluation and improvement steps interact.
Algorithm: On-Policy First-Visit MC Control (for ϵ-soft policies)
The interaction loop for On-Policy First-Visit MC Control. Episodes generated using the current ϵ-greedy policy are used to update Q-values (evaluation), which in turn are used to update the ϵ-greedy policy (improvement).
This algorithm interleaves evaluation and improvement on a step-by-step basis within each episode's processing. It guarantees that exploration continues because the policy π remains ϵ-soft (non-zero probability for all actions). As more episodes are collected, the Q-value estimates converge towards the true action values Qπ(s,a) for the ϵ-greedy policy π, and the policy itself gradually improves towards the optimal ϵ-greedy policy. While not necessarily the truly optimal policy (due to the forced exploration), it's often close and significantly better than random behavior.
In practice, you might use dictionaries in Python to store Q(s,a) and N(s,a), where keys are state-action tuples (s, a)
. Episode generation involves sampling actions according to the current ϵ-greedy policy probabilities stored or calculated on the fly. The subsequent practice section will provide a more concrete example.
© 2025 ApX Machine Learning