Alright, let's put theory into practice and implement the REINFORCE algorithm. As we discussed, REINFORCE (also known as Monte Carlo Policy Gradient) works by directly adjusting the parameters θ of our policy πθ(a∣s) to maximize expected future rewards. The core idea is to sample complete episodes, calculate the return Gt for each step, and then update the policy parameters in the direction suggested by the policy gradient theorem. Specifically, we want to increase the log probability of actions taken at state st if they led to a high return Gt, and decrease it otherwise.
The update rule, in its simplest form (without a baseline), looks like this:
θ←θ+α∇θlogπθ(at∣st)Gt
Here, α is the learning rate, ∇θlogπθ(at∣st) is the gradient of the log probability of the action taken, and Gt is the return from time step t onwards.
We'll use the classic CartPole-v1
environment from the Gymnasium library. In this environment, the goal is to balance a pole on a cart by moving the cart left or right. The state is represented by four continuous values (cart position, cart velocity, pole angle, pole angular velocity), and the action space is discrete (0 for left, 1 for right). This is a good fit for REINFORCE because the policy needs to map continuous states to discrete actions.
First, ensure you have Gymnasium and a deep learning library like PyTorch installed.
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import namedtuple, deque
import matplotlib.pyplot as plt
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Create the environment
env = gym.make('CartPole-v1')
# Get state and action dimensions
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
print(f"State dimensions: {state_dim}")
print(f"Action dimensions: {action_dim}")
Now, let's define our policy network. A simple Multi-Layer Perceptron (MLP) should suffice for CartPole. This network will take the state as input and output the probabilities for each possible action (left or right).
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
x = F.relu(self.fc1(state))
# Output raw scores (logits) for actions
action_scores = self.fc2(x)
# Convert scores to probabilities using softmax
# Use dim=-1 to apply softmax across the action dimension
return F.softmax(action_scores, dim=-1)
# Instantiate the policy network and optimizer
policy_net = PolicyNetwork(state_dim, action_dim).to(device)
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3) # Learning rate
The core of REINFORCE involves interacting with the environment to collect trajectories (sequences of states, actions, and rewards) and then using these trajectories to update the policy network.
Let's structure the training loop:
# Define a structure to store trajectory data
SavedAction = namedtuple('SavedAction', ['log_prob', 'value']) # Value here means Return G_t
def select_action(state):
"""Selects an action based on the policy network's output probabilities."""
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
probs = policy_net(state)
# Create a categorical distribution over the list of probabilities of actions
m = torch.distributions.Categorical(probs)
# Sample an action from the distribution
action = m.sample()
# Store the log probability of the sampled action
policy_net.saved_action = SavedAction(m.log_prob(action), 0) # Placeholder for G_t
return action.item()
def calculate_returns(rewards, gamma=0.99):
"""Calculates discounted returns for an episode."""
R = 0
returns = []
# Iterate backwards through the rewards
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R) # Prepend to maintain order
# Normalize returns (optional but often helpful)
returns = torch.tensor(returns, device=device)
returns = (returns - returns.mean()) / (returns.std() + np.finfo(np.float32).eps.item())
return returns
def finish_episode(episode_rewards):
"""Performs the REINFORCE update at the end of an episode."""
policy_loss = []
returns = calculate_returns(episode_rewards)
# Retrieve saved log probabilities and associate them with calculated returns
for (log_prob, _), G_t in zip(policy_net.all_saved_actions, returns):
# REINFORCE objective: maximize log_prob * G_t
# Loss is the negative objective
policy_loss.append(-log_prob * G_t)
# Sum the losses for the episode
optimizer.zero_grad() # Reset gradients
loss = torch.stack(policy_loss).sum() # Combine losses
loss.backward() # Compute gradients
optimizer.step() # Update network weights
# Clear saved actions for the next episode
del policy_net.all_saved_actions[:]
# Training parameters
num_episodes = 1000
gamma = 0.99
log_interval = 50 # Print status every 50 episodes
max_steps_per_episode = 1000 # Prevent excessively long episodes
all_episode_rewards = []
episode_durations = []
# Main training loop
for i_episode in range(num_episodes):
state, _ = env.reset()
episode_rewards = []
policy_net.all_saved_actions = [] # Store log_probs for the episode
for t in range(max_steps_per_episode):
action = select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
policy_net.all_saved_actions.append(policy_net.saved_action)
episode_rewards.append(reward)
state = next_state
if done:
break
# Episode finished, update policy
finish_episode(episode_rewards)
# Logging
total_reward = sum(episode_rewards)
all_episode_rewards.append(total_reward)
episode_durations.append(t + 1)
if i_episode % log_interval == 0:
avg_reward = np.mean(all_episode_rewards[-log_interval:])
print(f'Episode {i_episode}\tAverage Reward (last {log_interval}): {avg_reward:.2f}\tLast Duration: {t+1}')
# Optional: Stop training if the problem is solved
# CartPole-v1 is considered solved if avg reward > 475 over 100 consecutive episodes
if len(all_episode_rewards) > 100:
if np.mean(all_episode_rewards[-100:]) > 475:
print(f'\nSolved in {i_episode} episodes!')
break
env.close()
After training, it's useful to plot the rewards per episode to see if the agent learned effectively.
# Plotting the results
plt.figure(figsize=(12, 6))
plt.plot(all_episode_rewards)
plt.title('Episode Rewards over Time (REINFORCE)')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
# Calculate and plot rolling average
rolling_avg = np.convolve(all_episode_rewards, np.ones(100)/100, mode='valid')
plt.plot(np.arange(99, len(all_episode_rewards)), rolling_avg, label='100-episode rolling average', color='orange')
plt.legend()
plt.grid(True)
plt.show()
{"data":[{"type":"scatter","mode":"lines","name":"Episode Reward","x":[i for i in range(len(all_episode_rewards))], "y": all_episode_rewards, "line":{"color": "#339af0"}}, {"type":"scatter","mode":"lines","name":"100-episode Rolling Average","x":[i for i in range(99, len(all_episode_rewards))], "y": list(np.convolve(all_episode_rewards, np.ones(100)/100, mode='valid')), "line":{"color": "#fd7e14"}}], "layout":{"title":"Episode Rewards over Time (REINFORCE)", "xaxis":{"title":"Episode"}, "yaxis":{"title":"Total Reward"}, "template":"plotly_white"}}
Plot showing the total reward obtained in each episode during training, along with a 100-episode rolling average to visualize the learning trend.
This implementation demonstrates the basic REINFORCE algorithm. You should observe the agent's performance gradually improving, indicated by longer episode durations and higher total rewards.
returns = (returns - returns.mean()) / (returns.std() + eps)
) is a common technique that helps stabilize training. It centers the returns around zero, meaning actions leading to above-average returns increase in probability, while those leading to below-average returns decrease.This hands-on example provides a foundation for understanding and implementing policy gradient methods. Feel free to experiment with different hyperparameters (learning rate, network architecture, discount factor) or try implementing a baseline to see how it affects performance.
© 2025 ApX Machine Learning