Now that we have explored the theory behind policy gradients and the REINFORCE algorithm, it's time to put this knowledge into practice. This section provides a step-by-step guide to implementing the basic REINFORCE algorithm, often referred to as Monte Carlo Policy Gradient, using Python and a deep learning library like PyTorch or TensorFlow. We will apply it to a classic control problem, CartPole, available through the Gymnasium (formerly OpenAI Gym) library.Our goal is to train an agent that learns a policy $\pi(a|s; \theta)$ to balance a pole on a cart for as long as possible. REINFORCE achieves this by adjusting the policy parameters $\theta$ based on the outcomes of complete episodes.Environment SetupFirst, ensure you have Gymnasium installed (pip install gymnasium[classic_control]). We'll use the CartPole-v1 environment. It provides observations (cart position, cart velocity, pole angle, pole angular velocity), expects discrete actions (0 for push left, 1 for push right), and gives a reward of +1 for every timestep the pole remains upright. An episode terminates if the pole angle exceeds a threshold, the cart moves too far from the center, or after 500 steps.# Example using Gymnasium import gymnasium as gym env = gym.make('CartPole-v1') state_dim = env.observation_space.shape[0] action_dim = env.action_space.n print(f"State dimensions: {state_dim}") # Output: 4 print(f"Action dimensions: {action_dim}") # Output: 2The Policy NetworkWe need a function approximator to represent our policy $\pi(a|s; \theta)$. A simple feedforward neural network is suitable for CartPole. The network takes the state as input and outputs the probabilities for each possible action.Input Layer: Size matches the state dimension (4 for CartPole).Hidden Layer(s): One or more layers with activation functions (e.g., ReLU) to introduce non-linearity. A size like 64 or 128 neurons often works well.Output Layer: Size matches the number of discrete actions (2 for CartPole). A softmax activation function is applied to ensure the outputs represent a valid probability distribution over actions.# Example Policy Network Structure (using PyTorch) import torch import torch.nn as nn import torch.nn.functional as F class PolicyNetwork(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim=128): super(PolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, action_dim) def forward(self, state): x = F.relu(self.fc1(state)) action_probs = F.softmax(self.fc2(x), dim=-1) return action_probsThe REINFORCE Agent LogicLet's break down the core components of the REINFORCE agent.Initialization: Create an instance of the PolicyNetwork. Choose an optimizer, such as Adam, to update the network's weights ($\theta$). Set a learning rate.# Example Initialization policy_net = PolicyNetwork(state_dim, action_dim) optimizer = torch.optim.Adam(policy_net.parameters(), lr=1e-3) gamma = 0.99 # Discount factorAction Selection: Given a state $s$, pass it through the policy_net to get action probabilities. Sample an action $a$ from this probability distribution. In PyTorch, you can use torch.distributions.Categorical. Store the log probability $\log \pi(a|s; \theta)$ of the chosen action; we'll need it for the update.# Example Action Selection def select_action(state, policy_net): state_tensor = torch.FloatTensor(state).unsqueeze(0) # Add batch dimension action_probs = policy_net(state_tensor) distribution = torch.distributions.Categorical(action_probs) action = distribution.sample() log_prob = distribution.log_prob(action) return action.item(), log_probEpisode Rollout: Run a full episode in the environment:Reset the environment to get the initial state $s_0$.Loop until the episode terminates:Select action $a_t$ using the current policy network and state $s_t$. Keep track of the log probability $\log \pi(a_t|s_t; \theta)$.Execute action $a_t$ in the environment to get the next state $s_{t+1}$, reward $r_{t+1}$, and termination signal.Store the reward $r_{t+1}$ and the log probability $\log \pi(a_t|s_t; \theta)$.Update the current state $s_t \leftarrow s_{t+1}$.Keep lists of all log probabilities and rewards collected during the episode.Return Calculation: After the episode finishes (at step $T$), calculate the discounted return $G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1}$ for each timestep $t$ in the episode. A common way to compute this efficiently is to iterate backward from the end of the episode:Initialize $G_T = 0$.For $t = T-1, T-2, \dots, 0$:$G_t = r_{t+1} + \gamma G_{t+1}$.It's often beneficial to normalize the returns (e.g., subtract the mean and divide by the standard deviation) across the episode to stabilize training.# Example Return Calculation def calculate_returns(rewards, gamma): returns = [] discounted_return = 0 for r in reversed(rewards): discounted_return = r + gamma * discounted_return returns.insert(0, discounted_return) # Prepend to keep order returns = torch.tensor(returns) # Normalize returns (optional but recommended) returns = (returns - returns.mean()) / (returns.std() + 1e-9) return returnsLoss Calculation and Gradient Update: Compute the REINFORCE loss. The objective is to maximize expected return, so we perform gradient ascent. Most deep learning libraries implement gradient descent, so we minimize the negative of the objective function. The loss for one episode is:$$ L(\theta) = - \sum_{t=0}^{T-1} \log \pi(a_t|s_t; \theta) G_t $$Calculate this sum using the stored log probabilities and calculated returns from the episode. Then, perform backpropagation and update the network parameters using the optimizer.# Example Update Step def update_policy(log_probs, returns, optimizer): loss = [] for log_prob, Gt in zip(log_probs, returns): loss.append(-log_prob * Gt) # Negative sign for gradient ascent via minimization optimizer.zero_grad() policy_loss = torch.stack(loss).sum() # Sum losses over the episode policy_loss.backward() optimizer.step()The Training LoopThe overall training process involves running multiple episodes and updating the policy network after each one.# Simplified Training Loop Structure num_episodes = 1000 episode_rewards = [] for episode in range(num_episodes): state, _ = env.reset() episode_log_probs = [] episode_rewards_raw = [] terminated = False truncated = False while not terminated and not truncated: action, log_prob = select_action(state, policy_net) next_state, reward, terminated, truncated, _ = env.step(action) episode_log_probs.append(log_prob) episode_rewards_raw.append(reward) state = next_state # Calculate returns and update policy after the episode ends returns = calculate_returns(episode_rewards_raw, gamma) update_policy(episode_log_probs, returns, optimizer) total_episode_reward = sum(episode_rewards_raw) episode_rewards.append(total_episode_reward) if (episode + 1) % 50 == 0: print(f"Episode {episode+1}, Average Reward (last 50): {sum(episode_rewards[-50:])/50:.2f}") env.close()Visualizing Training ProgressPlotting the total reward per episode (or a moving average) is essential to see if the agent is learning.{"data": [{"x": [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950], "y": [21.5, 25.8, 35.2, 48.9, 65.1, 88.3, 115.6, 150.2, 195.7, 230.4, 280.1, 325.9, 370.5, 410.8, 435.2, 450.6, 465.3, 478.9, 485.1, 490.7], "type": "scatter", "mode": "lines+markers", "name": "Average Reward (Moving Avg)", "line": {"color": "#228be6"}}], "layout": {"title": "REINFORCE Training Progress on CartPole-v1", "xaxis": {"title": "Episode"}, "yaxis": {"title": "Average Reward (per 50 episodes)", "range": [0, 510]}, "height": 400}}A typical learning curve for REINFORCE on CartPole, showing the average reward over windows of 50 episodes improving over time. The maximum possible reward is 500.Incorporating a BaselineAs discussed previously, REINFORCE suffers from high variance. Subtracting a baseline $b(s_t)$ from the return $G_t$ can significantly reduce this variance without introducing bias. A common baseline is the state-value function $V(s_t)$. We can estimate $V(s_t)$ using another neural network (the "critic") trained to predict the expected return from state $s_t$.The modified update objective becomes minimizing:$$ L(\theta) = - \sum_{t=0}^{T-1} \log \pi(a_t|s_t; \theta) (G_t - V(s_t; \phi)) $$Where $V(s_t; \phi)$ is the value estimated by the critic network with parameters $\phi$. The critic network itself is typically trained using supervised learning, minimizing the squared error between its predictions $V(s_t; \phi)$ and the actual calculated returns $G_t$. This architecture leads us towards Actor-Critic methods, which we will explore in the next chapter.Even a simple baseline, like the average return over the episode, can sometimes help. For this practical, we focused on the core REINFORCE algorithm. Experimenting with baselines is a valuable next step.This hands-on example demonstrates the fundamental mechanics of the REINFORCE algorithm. While simple, it highlights how we can directly optimize a policy using gradient ascent based on sampled trajectories and their returns. Remember that tuning hyperparameters (learning rate, network architecture, discount factor, normalization) is often necessary to achieve good performance.