Okay, let's translate theory into practice. We've discussed how Deep Q-Networks use neural networks to approximate Q-values, overcoming the limitations of tabular methods for large state spaces. Now, we'll implement a DQN agent to solve the classic CartPole control problem using the Gymnasium library and PyTorch.
The CartPole environment is a standard benchmark in RL. The goal is to balance a pole upright on a moving cart.
Our objective is to train a DQN agent that learns a policy to maximize the total reward, effectively keeping the pole balanced for as long as possible.
Before we begin, make sure you have the necessary libraries installed:
pip install gymnasium torch numpy matplotlib
We'll use gymnasium
for the environment, torch
for building and training the neural network, numpy
for numerical operations, and matplotlib
(or a similar library) for potentially visualizing results later.
Our DQN implementation requires several key pieces we discussed earlier: the Q-Network, the Target Network (a copy of the Q-Network), and an Experience Replay buffer.
We need a neural network that takes a state representation as input and outputs the estimated Q-values for each possible action. Since the CartPole state space is relatively small (4 dimensions) and actions are discrete (2 actions), a simple Multi-Layer Perceptron (MLP) will suffice.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
class QNetwork(nn.Module):
"""Actor (Policy) Model."""
def __init__(self, state_size, action_size, seed, fc1_units=64, fc2_units=64):
"""Initialize parameters and build model.
Params
======
state_size (int): Dimension of each state
action_size (int): Dimension of each action
seed (int): Random seed
fc1_units (int): Number of nodes in first hidden layer
fc2_units (int): Number of nodes in second hidden layer
"""
super(QNetwork, self).__init__()
self.seed = torch.manual_seed(seed)
self.fc1 = nn.Linear(state_size, fc1_units)
self.fc2 = nn.Linear(fc1_units, fc2_units)
self.fc3 = nn.Linear(fc2_units, action_size)
def forward(self, state):
"""Build a network that maps state -> action values."""
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
return self.fc3(x)
This network has an input layer matching the state size (4), two hidden layers with 64 units each and ReLU activation functions, and an output layer with units equal to the action size (2), providing the Q-value for pushing left and pushing right.
To store transitions and sample them for learning, we implement a replay buffer. A collections.deque
is often used for efficient appending and popping.
import random
import torch
import numpy as np
from collections import deque, namedtuple
class ReplayBuffer:
"""Fixed-size buffer to store experience tuples."""
def __init__(self, action_size, buffer_size, batch_size, seed, device):
"""Initialize a ReplayBuffer object.
Params
======
action_size (int): dimension of each action
buffer_size (int): maximum size of buffer
batch_size (int): size of each training batch
seed (int): random seed
device (string): 'cpu' or 'cuda'
"""
self.action_size = action_size
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
self.seed = random.seed(seed)
self.device = device
def add(self, state, action, reward, next_state, done):
"""Add a new experience to memory."""
e = self.experience(state, action, reward, next_state, done)
self.memory.append(e)
def sample(self):
"""Randomly sample a batch of experiences from memory."""
experiences = random.sample(self.memory, k=self.batch_size)
# Convert batch of Experiences to tensors on the specified device
states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(self.device)
actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(self.device)
rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(self.device)
next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(self.device)
dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
return (states, actions, rewards, next_states, dones)
def __len__(self):
"""Return the current size of internal memory."""
return len(self.memory)
This buffer stores Experience
tuples and provides methods to add
a new experience and sample
a random batch for training.
Now we combine these components into a single Agent
class. This class will manage the Q-Network, Target Network, Replay Buffer, and the learning process.
import numpy as np
import random
from collections import namedtuple, deque
# Define Hyperparameters (example values)
BUFFER_SIZE = int(1e5) # Replay buffer size
BATCH_SIZE = 64 # Minibatch size
GAMMA = 0.99 # Discount factor
TAU = 1e-3 # For soft update of target parameters
LR = 5e-4 # Learning rate
UPDATE_EVERY = 4 # How often to update the network
# Check for GPU availability
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class Agent():
"""Interacts with and learns from the environment."""
def __init__(self, state_size, action_size, seed):
"""Initialize an Agent object.
Params
======
state_size (int): dimension of each state
action_size (int): dimension of each action
seed (int): random seed
"""
self.state_size = state_size
self.action_size = action_size
self.seed = random.seed(seed)
# Q-Network
self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device)
self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device)
self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR)
# Replay memory
self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed, device)
# Initialize time step (for updating every UPDATE_EVERY steps)
self.t_step = 0
def step(self, state, action, reward, next_state, done):
# Save experience in replay memory
self.memory.add(state, action, reward, next_state, done)
# Learn every UPDATE_EVERY time steps.
self.t_step = (self.t_step + 1) % UPDATE_EVERY
if self.t_step == 0:
# If enough samples are available in memory, get random subset and learn
if len(self.memory) > BATCH_SIZE:
experiences = self.memory.sample()
self.learn(experiences, GAMMA)
def act(self, state, eps=0.):
"""Returns actions for given state as per current policy.
Params
======
state (array_like): current state
eps (float): epsilon, for epsilon-greedy action selection
"""
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.qnetwork_local.eval() # Set network to evaluation mode
with torch.no_grad():
action_values = self.qnetwork_local(state)
self.qnetwork_local.train() # Set network back to training mode
# Epsilon-greedy action selection
if random.random() > eps:
# Choose the best action (exploitation)
return np.argmax(action_values.cpu().data.numpy())
else:
# Choose a random action (exploration)
return random.choice(np.arange(self.action_size))
def learn(self, experiences, gamma):
"""Update value parameters using given batch of experience tuples.
Params
======
experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples
gamma (float): discount factor
"""
states, actions, rewards, next_states, dones = experiences
# Get max predicted Q values (for next states) from target model
# We detach() the output Q_targets_next from the graph -> no gradients calculated for target network
Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
# Compute Q targets for current states
# Target = Reward + Gamma * Q_Target(next_state, max_action) * (1 - done)
# If done is 1, the future reward is 0
Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
# Get expected Q values from local model
# We want the Q value for the action that was actually taken
Q_expected = self.qnetwork_local(states).gather(1, actions)
# Compute loss (Mean Squared Error or Huber Loss)
loss = F.mse_loss(Q_expected, Q_targets)
# Minimize the loss
self.optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
self.optimizer.step() # Update weights
# ------------------- update target network ------------------- #
self.soft_update(self.qnetwork_local, self.qnetwork_target, TAU)
def soft_update(self, local_model, target_model, tau):
"""Soft update model parameters.
θ_target = τ*θ_local + (1 - τ)*θ_target
Params
======
local_model (PyTorch model): weights will be copied from
target_model (PyTorch model): weights will be copied to
tau (float): interpolation parameter
"""
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
Key aspects of the Agent
:
__init__
): Creates both the local and target Q-Networks (initially identical), sets up the Adam optimizer for the local network, and initializes the replay buffer.step
): Called at each timestep. It stores the experience (s, a, r, s', done)
in the buffer and triggers the learning process every UPDATE_EVERY
steps, provided the buffer is sufficiently full.act
): Implements the epsilon-greedy policy. With probability epsilon
, it chooses a random action (exploration); otherwise, it queries the local Q-network for the action with the highest estimated Q-value (exploitation). Note the use of eval()
mode during action selection to disable dropout or batch normalization updates.learn
): This is the core training logic:
(1 - dones)
term ensures the future value is zero for terminal states. detach()
is used to prevent gradients from flowing into the target network parameters during this calculation..gather(1, actions)
part selects the Q-values corresponding to the specific actions stored in the batch.Q_targets
and Q_expected
.soft_update
to slowly blend the weights of the local network into the target network.soft_update
): Gradually updates the target network's weights towards the local network's weights using the parameter TAU
. This provides more stability than directly copying weights infrequently.Finally, we need the main script to initialize the environment and the agent, and run the training episodes.
import gymnasium as gym
from collections import deque
import matplotlib.pyplot as plt
# Initialize environment and agent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = Agent(state_size=state_size, action_size=action_size, seed=0)
def train_dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
"""Deep Q-Learning Training Loop.
Params
======
n_episodes (int): maximum number of training episodes
max_t (int): maximum number of timesteps per episode
eps_start (float): starting value of epsilon, for epsilon-greedy action selection
eps_end (float): minimum value of epsilon
eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
"""
scores = [] # list containing scores from each episode
scores_window = deque(maxlen=100) # last 100 scores
eps = eps_start # initialize epsilon
solved_threshold = 195.0 # Avg score threshold to consider env solved (for CartPole-v1)
for i_episode in range(1, n_episodes+1):
state, info = env.reset()
score = 0
for t in range(max_t):
action = agent.act(state, eps)
next_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated # Episode ends if terminated or truncated
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
break
scores_window.append(score) # save most recent score
scores.append(score) # save most recent score
eps = max(eps_end, eps_decay*eps) # decrease epsilon
print(f'\rEpisode {i_episode}\tAverage Score: {np.mean(scores_window):.2f}', end="")
if i_episode % 100 == 0:
print(f'\rEpisode {i_episode}\tAverage Score: {np.mean(scores_window):.2f}')
if np.mean(scores_window)>= solved_threshold:
print(f'\nEnvironment solved in {i_episode-100:d} episodes!\tAverage Score: {np.mean(scores_window):.2f}')
# Save the trained model weights
torch.save(agent.qnetwork_local.state_dict(), 'dqn_cartpole_weights.pth')
break
return scores
# Start training
scores = train_dqn()
# Plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.title('DQN Training Performance on CartPole-v1')
plt.grid(True)
plt.show()
This loop runs for a set number of episodes (n_episodes
). In each episode:
act
, env.step
).agent.step
), potentially triggering a learning update.After training, a plot shows the score per episode, which should ideally show an upward trend as the agent learns.
Sample learning curve showing episode scores increasing over time and eventually crossing the 'solved' threshold.
In this hands-on section, we implemented a functional Deep Q-Network agent from scratch. We defined the network architecture, the experience replay mechanism, and the agent logic incorporating epsilon-greedy action selection, learning from sampled batches, and target network updates. By applying it to the CartPole environment, we demonstrated how DQN can learn effective policies in environments where tabular methods would be impractical. This forms a solid foundation for tackling more complex problems and understanding the enhancements discussed in the next chapter.
© 2025 ApX Machine Learning