Building upon the idea of directly optimizing a parameterized policy πθ(a∣s), we now introduce a specific algorithm to achieve this: REINFORCE. Also known as Monte Carlo Policy Gradient, REINFORCE provides a straightforward way to adjust the policy parameters θ based on the outcomes of interactions with the environment.
The core principle of REINFORCE is intuitive: increase the probability of actions that lead to good outcomes and decrease the probability of actions that lead to poor outcomes. Since we're dealing with parameterized policies, "increasing the probability" means adjusting the parameters θ in a direction suggested by the gradient of the policy's performance.
REINFORCE is a Monte Carlo method because it learns only after observing the complete return from an entire episode. It doesn't update parameters mid-episode like Temporal Difference methods (SARSA, Q-learning). Instead, it waits until the episode finishes, calculates the return (Gt) obtained from each state St onwards, and then updates the policy parameters based on these returns.
The REINFORCE Update Rule
Let J(θ) represent the objective function we want to maximize, typically the expected return from the start state. The Policy Gradient Theorem provides a way to calculate the gradient of this objective with respect to the policy parameters θ. REINFORCE uses a sampled version of this gradient based on experiences from episodes.
For a single episode, the update applied to the parameters θ after the episode concludes, considering each time step t, is:
θ←θ+αt=0∑T−1Gt∇θlogπθ(At∣St)
Alternatively, updates can happen incrementally at each step t of the episode, using the return Gt calculated after the episode finishes:
θ←θ+αGt∇θlogπθ(At∣St)
Let's break down the components of this update rule:
- α: The learning rate, a small positive scalar determining the step size.
- Gt: The total discounted return starting from time step t until the end of the episode. It's calculated as Gt=Rt+1+γRt+2+γ2Rt+3+...+γT−t−1RT, where T is the terminal time step and γ is the discount factor. This Gt value represents the actual outcome experienced after taking action At in state St (and following policy πθ thereafter).
- ∇θlogπθ(At∣St): This is the gradient of the logarithm of the probability of taking the specific action At in state St, according to the current policy parameterization θ. This term is often called the "eligibility vector". It points in the direction in parameter space that most rapidly increases the log-probability of the action At being chosen in state St.
The update rule essentially pushes the parameters θ in the direction specified by ∇θlogπθ(At∣St), scaled by the return Gt.
- If Gt is positive (better than average outcome, loosely speaking), the update increases the probability of taking action At in state St in the future.
- If Gt is negative (worse than average outcome), the update decreases the probability of taking action At in state St in the future.
The magnitude of the update is proportional to the return Gt, so actions leading to significantly better or worse outcomes result in larger parameter adjustments.
The REINFORCE Algorithm Steps
Here's a typical outline of the REINFORCE algorithm:
- Initialize: Choose a policy parameterization πθ(a∣s) (e.g., a neural network with parameters θ) and initialize θ randomly or with predetermined weights. Set a learning rate α.
- Loop (for each episode):
a. Generate Episode: Run one full episode by following the current policy πθ. This means starting in S0, choosing A0∼πθ(⋅∣S0), observing R1,S1, choosing A1∼πθ(⋅∣S1), observing R2,S2, and so on, until a terminal state ST is reached. Store the sequence of states, actions, and rewards: (S0,A0,R1,S1,A1,R2,...,ST−1,AT−1,RT).
b. Calculate Returns: For each time step t in the episode (from t=0 to T−1):
Calculate the return Gt=∑k=t+1Tγk−t−1Rk.
c. Update Parameters: For each time step t in the episode (from t=0 to T−1):
Compute the gradient ∇θlogπθ(At∣St).
Update the policy parameters: θ←θ+αGt∇θlogπθ(At∣St).
- Repeat: Go back to step 2 to generate and learn from more episodes.
The updates can be accumulated across all steps within an episode and applied once at the end, or applied incrementally after calculating each Gt. Modern deep learning frameworks often facilitate calculating ∇θlogπθ(At∣St) using automatic differentiation.
Representing the Policy
The policy πθ(a∣s) needs to be a function that takes the state s and parameters θ as input and outputs probabilities for each possible action a. For discrete action spaces, a common approach is to use a neural network that outputs scores (logits) for each action, followed by a softmax function to convert these scores into probabilities:
πθ(a∣s)=softmax(NNθ(s))a=∑belogitbelogita
Where NNθ(s) is the output vector of the neural network for state s, and logita is the element corresponding to action a. The gradient ∇θlogπθ(a∣s) can then be computed with respect to the network parameters θ.
Advantages and Disadvantages
REINFORCE is conceptually simple and serves as a cornerstone for more advanced policy gradient methods.
Advantages:
- Can learn stochastic policies naturally.
- Can handle continuous action spaces by parameterizing the policy appropriately (e.g., outputting parameters of a Gaussian distribution).
- Relatively easy to understand and implement compared to some other methods.
Disadvantages:
- High Variance: The biggest drawback. The update depends on the Monte Carlo return Gt, which can vary significantly between episodes even with the same policy, purely due to randomness in the environment or the policy itself. This high variance makes learning slow and unstable.
- Sample Inefficiency: Learning only occurs at the end of an episode, potentially requiring many episodes to make substantial progress, especially for long episodes.
- Convergence: Can sometimes converge to local optima rather than the global optimum.
The high variance issue is particularly important. Imagine an episode where most actions were good, but one random unlucky event near the end resulted in a very low overall return Gt for all preceding steps. REINFORCE would incorrectly penalize all actions taken earlier in that episode. This motivates the development of techniques like baselines and Actor-Critic methods, which we will discuss next, to mitigate this variance and improve learning stability and efficiency.