In the previous section, we saw how Monte Carlo (MC) methods can estimate the state-value function Vπ(s) for a given policy π by averaging the returns observed after visiting state s. This process, called MC prediction or evaluation, tells us how good it is to be in a particular state under the current policy. However, our ultimate aim in Reinforcement Learning is usually control finding the optimal policy π∗ that maximizes the expected return from all states.
To achieve control using MC methods, especially when we don't have a model of the environment (i.e., we don't know the transition probabilities p(s′,r∣s,a)), estimating the state-value function Vπ is often not sufficient. Why? Because improving the policy requires knowing the value of choosing different actions in a given state. If we only know Vπ(s), we can't easily determine which action a leads to the best outcome without knowing the environment's dynamics. We'd need the model to compute ∑s′,rp(s′,r∣s,a)[r+γVπ(s′)].
This is where the action-value function Qπ(s,a) becomes essential. Recall that Qπ(s,a) represents the expected return starting from state s, taking action a, and subsequently following policy π. If we have an accurate estimate of Qπ(s,a) for all actions a available in state s, choosing the best action is straightforward: we simply select the action with the highest estimated Q-value.
π′(s)=aargmaxQ(s,a)
Therefore, for model-free control using MC methods, we shift our focus from estimating Vπ(s) to estimating Qπ(s,a).
The core idea remains the same as MC prediction: we learn from complete episodes of experience and average the observed returns. However, instead of averaging returns for each state visited, we average returns for each state-action pair (s,a) visited.
Let's consider the first-visit MC method applied to estimating Qπ. We run multiple episodes following policy π. For each episode, we look at every state-action pair (St,At) that occurred. If it's the first time this specific pair (St,At) was visited during this episode, we calculate the total return Gt from that point until the end of the episode:
Gt=Rt+1+γRt+2+γ2Rt+3+⋯+γT−t−1RT
where T is the terminal time step of the episode.
We maintain an estimate Q(s,a) for every state-action pair. Similar to MC prediction for Vπ, we can update our estimate Q(St,At) by averaging all the returns Gt observed after first visits to (St,At) across all episodes.
A practical way to implement this is using an incremental update rule. We keep track of the number of times each state-action pair (s,a) has been visited for the first time across episodes, let's call this count N(s,a). After generating an episode and calculating the return Gt following the first visit to (St,At), we update our estimate Q(St,At) as follows:
This formula adjusts the current estimate Q(St,At) in the direction of the newly observed return Gt. The step size N(St,At)1 ensures that we are essentially calculating a running average; early returns have a larger impact, while later returns cause smaller adjustments as our estimate becomes more confident. Often, a constant step size α is used instead of N(s,a)1 for non-stationary problems (where the environment or policy might change):
Q(St,At)←Q(St,At)+α(Gt−Q(St,At))
where 0<α≤1 is a learning rate parameter.
Now we have a way to estimate Qπ(s,a), the action-values for a given policy π. But how do we use this to find the optimal policy π∗?
The natural inclination is to follow the pattern of Generalized Policy Iteration (GPI), which involves alternating between policy evaluation (estimating the value function for the current policy) and policy improvement (making the policy greedy with respect to the current value function estimates).
With MC methods, this would look like:
However, there's a significant complication: exploration. If the policy π becomes deterministic and greedy early on (e.g., it always chooses action a1 from state s), then we will only ever observe returns for the pair (s,a1). We will never gather experience (and thus never update the Q-value estimates) for other potentially better actions (like a2,a3,…) from state s. The agent might get stuck in a suboptimal loop because it never explores alternatives.
To guarantee that MC methods converge to the optimal policy, we need to ensure that all state-action pairs continue to be visited. This is known as maintaining exploration.
One theoretical approach is the assumption of exploring starts. This assumes that every episode starts in a randomly chosen state-action pair (s,a), and only then follows the policy π. This ensures that every pair has a non-zero probability of being selected as the starting point, guaranteeing eventual exploration of all pairs. While useful for theoretical proofs, exploring starts are often not feasible in real-world applications (e.g., you might not be able to arbitrarily set the initial state and action of a robot).
Therefore, practical MC control algorithms need mechanisms to ensure ongoing exploration even without relying on exploring starts. This often involves using policies that are not fully deterministic or greedy, such as ϵ-greedy policies, which usually act greedily but sometimes (with a small probability ϵ) choose a random action. We will discuss these specific algorithms and the distinction between on-policy and off-policy learning, which relate directly to how exploration is handled, in the following sections.
© 2025 ApX Machine Learning