As we discussed in the chapter introduction, Monte Carlo (MC) methods offer a way to learn from experience without needing a model of the environment's dynamics (p(s′,r∣s,a)). The first step in using MC methods is often prediction, which means estimating the value function Vπ(s) for a given policy π. Remember, Vπ(s) represents the expected cumulative discounted reward starting from state s and following policy π thereafter.
The core idea behind MC prediction is remarkably simple: we want to estimate an expected value, Vπ(s)=Eπ[Gt∣St=s]. The law of large numbers tells us that we can approximate an expected value by taking the average of many samples. In MC methods, these samples are the actual returns (Gt) observed after visiting state s during simulated or real interactions with the environment.
To get these samples, we need to run complete episodes following the policy π. An episode is a sequence of states, actions, and rewards starting from an initial state and ending at a terminal state: S0,A0,R1,S1,A1,R2,…,ST. For any time step t within an episode, the return Gt is the total discounted reward from that point forward:
Gt=Rt+1+γRt+2+γ2Rt+3+⋯+γT−t−1RTWhere T is the time step of the terminal state and γ is the discount factor.
MC prediction estimates Vπ(s) by averaging the returns Gt observed following visits to state s across multiple episodes. There are two main variants for how we handle multiple visits to the same state within a single episode.
In the first-visit MC method, we only consider the return following the first time state s is visited in each episode.
Here's the procedure:
Returns(s)
, for each state s to store the returns observed for that state.Returns(S_t)
.Returns(S_t)
:
V(St)=average(Returns(St))
As more episodes are generated and more returns are collected for each state, the average in Returns(s)
will converge to the true expected return Vπ(s).
The every-visit MC method is similar, but it averages the returns following every visit to state s within each episode.
The procedure changes slightly in step 2:
Returns(S_t)
.Both first-visit and every-visit MC methods converge to the true state-value function Vπ(s) as the number of visits (first visits or all visits, respectively) to each state approaches infinity. First-visit MC is often easier to analyze theoretically, while every-visit MC might be slightly easier to implement initially and has also seen successful application. For large numbers of episodes, the differences often become less significant.
Storing all the returns for every state can become memory-intensive if we run many episodes or if the state space is large. We can compute the average incrementally without storing the whole list of returns.
Suppose we have just obtained a new return G after a visit (either first or every, depending on the method) to state s. Let N(s) be the count of returns collected for state s before this new one. We want to update our current estimate V(s) (which is the average of the previous N(s) returns) using the new return G.
The new average after N(s)+1 returns will be:
Vnew(s)=N(s)+11i=1∑N(s)+1Gi=N(s)+11(G+i=1∑N(s)Gi)Since the old estimate was Vold(s)=N(s)1∑i=1N(s)Gi, we have ∑i=1N(s)Gi=N(s)Vold(s). Substituting this in:
Vnew(s)=N(s)+11(G+N(s)Vold(s))Rearranging this gives the common incremental update rule:
Vnew(s)=Vold(s)+N(s)+11(G−Vold(s))This update rule looks like: NewEstimate = OldEstimate + StepSize * (Target - OldEstimate)
. Here, the return G serves as the "target" value we are trying to move our estimate towards, and the step size is N(s)+11.
To implement this, instead of lists Returns(s)
, we maintain:
When a new return G is observed for state s:
Sometimes, especially in non-stationary problems where the environment or policy might change over time, a constant step-size parameter α∈(0,1] is used instead of 1/N(s):
V(s)←V(s)+α(G−V(s))This gives more weight to recent returns, allowing the estimate to adapt if the true value Vπ(s) is changing. However, with a constant α, the estimate doesn't fully converge to the true value; it will fluctuate around it. For stationary problems where we want convergence to the true Vπ(s), the 1/N(s) step size is generally preferred as it satisfies the conditions for stochastic approximation convergence.
MC prediction provides a solid, model-free way to evaluate a given policy π. By simply running episodes and averaging returns, we can estimate how good it is to be in any particular state under that policy. This ability to evaluate policies is a building block for the next step: finding better policies, which is the goal of MC Control.
© 2025 ApX Machine Learning