Building upon the Policy Gradient Theorem, which provides a way to calculate the gradient of the expected return with respect to the policy parameters θ, we need a practical algorithm to estimate and use this gradient. The REINFORCE algorithm, also known as Monte Carlo Policy Gradient, provides exactly that. It learns directly from complete episodes of experience, sampled using the current policy πθ.
The core idea is elegant: we want to adjust the policy parameters θ such that actions leading to higher overall returns become more probable, and actions leading to lower returns become less probable. REINFORCE achieves this by sampling trajectories and using the actual returns experienced within those trajectories to guide the parameter updates.
Recall the policy gradient theorem:
∇θJ(θ)=Eτ∼πθ[t=0∑T−1∇θlogπθ(at∣st)Gt]
Here, J(θ) is the expected total return, τ represents a trajectory (episode) sampled according to the policy πθ, Gt is the total discounted return starting from time step t, and ∇θlogπθ(at∣st) is the gradient of the log-probability of taking action at in state st. This term, often called the "score function", indicates the direction in parameter space that increases the probability of the action at taken at state st.
Since we usually cannot compute the expectation Eτ∼πθ[⋅] directly (as it requires integrating over all possible trajectories), REINFORCE uses Monte Carlo estimation. We run the current policy πθ to generate one or more complete episodes. For each episode, we calculate the return Gt for every time step t. Then, we approximate the gradient using the sampled data.
The REINFORCE Algorithm
The algorithm proceeds as follows:
- Initialize: Start with randomly initialized policy parameters θ.
- Loop (for each episode):
a. Generate Trajectory: Run the policy πθ in the environment to collect a full episode: τ=(s0,a0,r1,s1,a1,r2,...,sT−1,aT−1,rT). Here, T is the terminal time step of the episode.
b. Loop (for each time step t from 0 to T−1):
i. Calculate Return: Compute the return from time step t onwards: Gt=∑k=t+1Tγk−t−1rk. Note that Gt is the actual, observed discounted return following state st and action at in this specific trajectory.
ii. Calculate Update: Determine the update direction for this step: ∇θlogπθ(at∣st)Gt.
c. Update Parameters: Update the policy parameters using the accumulated gradients from the episode, typically using stochastic gradient ascent:
θ←θ+α∑t=0T−1∇θlogπθ(at∣st)Gt
where α is the learning rate. The update can also be performed after each step or after a batch of episodes.
Understanding the Update
Let's break down the update step θ←θ+α∇θlogπθ(at∣st)Gt.
- ∇θlogπθ(at∣st) (Score Function): This vector points in the direction in parameter space (θ) that most increases the log-probability of the action at that was actually taken in state st during the sampled episode.
- Gt (Return): This scalar value represents the quality of the trajectory following the state-action pair (st,at). It acts as a weighting factor for the score function.
- If Gt is large and positive, the update pushes θ significantly in the direction that makes at more likely in state st. The algorithm "reinforces" this action because it led to a good outcome.
- If Gt is large and negative (or small and positive), the update still pushes θ in the direction indicated by the score function, but the magnitude is smaller, or it might even push away from that direction if Gt is negative. The algorithm weakly reinforces or even discourages this action because it led to a poor outcome.
Essentially, REINFORCE correlates the direction of steepest ascent for the action probability (∇θlogπθ) with the return (Gt) that followed that action. Actions associated with higher returns get "boosted" more strongly.
The following diagram outlines the information flow within a single REINFORCE update step based on a sampled episode.
Flow diagram illustrating how information from a sampled trajectory is used to calculate the return and score function, which are then combined to update the policy parameters in REINFORCE.
Strengths and Weaknesses
REINFORCE offers several advantages:
- Simplicity: The core concept and update rule are relatively straightforward.
- Direct Optimization: It directly optimizes the policy for the objective function (expected return).
- Compatibility: Works naturally with both discrete and continuous action spaces (by parameterizing appropriate probability distributions). It can also learn optimal stochastic policies.
However, it also has significant drawbacks:
- High Variance: The Monte Carlo estimation of the return Gt based on single trajectories can be extremely noisy. Different episodes can lead to vastly different returns, even if the policy changes slightly, resulting in high variance in the gradient estimate ∇θJ(θ). This high variance makes learning unstable and slow.
- Sample Inefficiency: It typically requires many episodes to converge because of the noisy gradient estimates.
- Requires Complete Episodes: Updates are performed only after an episode finishes, which can be problematic in very long or non-terminating tasks.
The high variance is the most pressing issue with the basic REINFORCE algorithm. In the next section, we will discuss how introducing baselines can help mitigate this problem significantly, leading to faster and more stable learning.
Implementation Considerations
When implementing REINFORCE, especially with neural networks:
- Policy Representation: The policy πθ(a∣s) is typically represented by a neural network that takes the state s as input.
- For discrete actions, the network usually outputs probabilities for each action (e.g., using a softmax output layer). logπθ(at∣st) is then the log of the output probability corresponding to the action at that was taken.
- For continuous actions, the network might output the parameters of a probability distribution (e.g., the mean μ and standard deviation σ of a Gaussian distribution). The action at is sampled from N(μ(st;θ),σ(st;θ)), and logπθ(at∣st) is the log-probability density function (PDF) of the chosen action under that distribution.
- Gradient Calculation: Deep learning libraries (like TensorFlow or PyTorch) handle the calculation of ∇θlogπθ(at∣st) via automatic differentiation. You typically define a loss function whose gradient matches the policy gradient update. A common approach is to use a surrogate loss like −logπθ(at∣st)Gt. Minimizing this loss using gradient descent is equivalent to performing gradient ascent on the objective J(θ) using the term ∇θlogπθ(at∣st)Gt. Remember that Gt is treated as a constant factor when computing the gradient for the update at step t.