Now that we have explored the theory behind policy gradients and the REINFORCE algorithm, it's time to put this knowledge into practice. This section provides a step-by-step guide to implementing the basic REINFORCE algorithm, often referred to as Monte Carlo Policy Gradient, using Python and a deep learning library like PyTorch or TensorFlow. We will apply it to a classic control problem, CartPole, available through the Gymnasium (formerly OpenAI Gym) library.
Our goal is to train an agent that learns a policy π(a∣s;θ) to balance a pole on a cart for as long as possible. REINFORCE achieves this by adjusting the policy parameters θ based on the outcomes of complete episodes.
First, ensure you have Gymnasium installed (pip install gymnasium[classic_control]
). We'll use the CartPole-v1
environment. It provides observations (cart position, cart velocity, pole angle, pole angular velocity), expects discrete actions (0 for push left, 1 for push right), and gives a reward of +1 for every timestep the pole remains upright. An episode terminates if the pole angle exceeds a threshold, the cart moves too far from the center, or after 500 steps.
# Example using Gymnasium
import gymnasium as gym
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
print(f"State dimensions: {state_dim}") # Output: 4
print(f"Action dimensions: {action_dim}") # Output: 2
We need a function approximator to represent our policy π(a∣s;θ). A simple feedforward neural network is suitable for CartPole. The network takes the state as input and outputs the probabilities for each possible action.
softmax
activation function is applied to ensure the outputs represent a valid probability distribution over actions.# Example Policy Network Structure (using PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
x = F.relu(self.fc1(state))
action_probs = F.softmax(self.fc2(x), dim=-1)
return action_probs
Let's break down the core components of the REINFORCE agent.
Initialization: Create an instance of the PolicyNetwork
. Choose an optimizer, such as Adam, to update the network's weights (θ). Set a learning rate.
# Example Initialization
policy_net = PolicyNetwork(state_dim, action_dim)
optimizer = torch.optim.Adam(policy_net.parameters(), lr=1e-3)
gamma = 0.99 # Discount factor
Action Selection: Given a state s, pass it through the policy_net
to get action probabilities. Sample an action a from this probability distribution. In PyTorch, you can use torch.distributions.Categorical
. Store the log probability logπ(a∣s;θ) of the chosen action; we'll need it for the update.
# Example Action Selection
def select_action(state, policy_net):
state_tensor = torch.FloatTensor(state).unsqueeze(0) # Add batch dimension
action_probs = policy_net(state_tensor)
distribution = torch.distributions.Categorical(action_probs)
action = distribution.sample()
log_prob = distribution.log_prob(action)
return action.item(), log_prob
Episode Rollout: Run a full episode in the environment:
Return Calculation: After the episode finishes (at step T), calculate the discounted return Gt=∑k=tT−1γk−trk+1 for each timestep t in the episode. A common way to compute this efficiently is to iterate backward from the end of the episode:
# Example Return Calculation
def calculate_returns(rewards, gamma):
returns = []
discounted_return = 0
for r in reversed(rewards):
discounted_return = r + gamma * discounted_return
returns.insert(0, discounted_return) # Prepend to keep order
returns = torch.tensor(returns)
# Normalize returns (optional but recommended)
returns = (returns - returns.mean()) / (returns.std() + 1e-9)
return returns
Loss Calculation and Gradient Update: Compute the REINFORCE loss. The objective is to maximize expected return, so we perform gradient ascent. Most deep learning libraries implement gradient descent, so we minimize the negative of the objective function. The loss for one episode is:
L(θ)=−t=0∑T−1logπ(at∣st;θ)GtCalculate this sum using the stored log probabilities and calculated returns from the episode. Then, perform backpropagation and update the network parameters using the optimizer.
# Example Update Step
def update_policy(log_probs, returns, optimizer):
loss = []
for log_prob, Gt in zip(log_probs, returns):
loss.append(-log_prob * Gt) # Negative sign for gradient ascent via minimization
optimizer.zero_grad()
policy_loss = torch.stack(loss).sum() # Sum losses over the episode
policy_loss.backward()
optimizer.step()
The overall training process involves running multiple episodes and updating the policy network after each one.
# Simplified Training Loop Structure
num_episodes = 1000
episode_rewards = []
for episode in range(num_episodes):
state, _ = env.reset()
episode_log_probs = []
episode_rewards_raw = []
terminated = False
truncated = False
while not terminated and not truncated:
action, log_prob = select_action(state, policy_net)
next_state, reward, terminated, truncated, _ = env.step(action)
episode_log_probs.append(log_prob)
episode_rewards_raw.append(reward)
state = next_state
# Calculate returns and update policy after the episode ends
returns = calculate_returns(episode_rewards_raw, gamma)
update_policy(episode_log_probs, returns, optimizer)
total_episode_reward = sum(episode_rewards_raw)
episode_rewards.append(total_episode_reward)
if (episode + 1) % 50 == 0:
print(f"Episode {episode+1}, Average Reward (last 50): {sum(episode_rewards[-50:])/50:.2f}")
env.close()
Plotting the total reward per episode (or a moving average) is essential to see if the agent is learning.
A typical learning curve for REINFORCE on CartPole, showing the average reward over windows of 50 episodes improving over time. The maximum possible reward is 500.
As discussed previously, REINFORCE suffers from high variance. Subtracting a baseline b(st) from the return Gt can significantly reduce this variance without introducing bias. A common baseline is the state-value function V(st). We can estimate V(st) using another neural network (the "critic") trained to predict the expected return from state st.
The modified update objective becomes minimizing:
L(θ)=−t=0∑T−1logπ(at∣st;θ)(Gt−V(st;ϕ))Where V(st;ϕ) is the value estimated by the critic network with parameters ϕ. The critic network itself is typically trained using supervised learning, minimizing the squared error between its predictions V(st;ϕ) and the actual calculated returns Gt. This architecture leads us towards Actor-Critic methods, which we will explore in the next chapter.
Even a simple baseline, like the average return over the episode, can sometimes help. For this practical, we focused on the core REINFORCE algorithm. Experimenting with baselines is a valuable next step.
This hands-on example demonstrates the fundamental mechanics of the REINFORCE algorithm. While simple, it highlights how we can directly optimize a policy using gradient ascent based on sampled trajectories and their returns. Remember that tuning hyperparameters (learning rate, network architecture, discount factor, normalization) is often necessary to achieve good performance.
© 2025 ApX Machine Learning