Deep Q-Networks use neural networks to approximate Q-values, overcoming the limitations of tabular methods for large state spaces. We will implement a DQN agent to solve the classic CartPole control problem using the Gymnasium library and PyTorch.The CartPole environment is a standard benchmark in RL. The goal is to balance a pole upright on a moving cart.State: The state is represented by a 4-dimensional vector: Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity.Actions: There are two discrete actions: Push cart left (0) or Push cart right (1).Reward: A reward of +1 is given for every timestep the pole remains upright.Termination: An episode ends if the pole angle exceeds ±12 degrees, the cart moves more than ±2.4 units from the center, or the episode length reaches a predefined limit (often 500 steps in newer Gymnasium versions).Our objective is to train a DQN agent that learns a policy to maximize the total reward, effectively keeping the pole balanced for as long as possible.PrerequisitesBefore we begin, make sure you have the necessary libraries installed:pip install gymnasium torch numpy matplotlibWe'll use gymnasium for the environment, torch for building and training the neural network, numpy for numerical operations, and matplotlib (or a similar library) for potentially visualizing results later.Building the DQN ComponentsOur DQN implementation requires several important pieces we discussed earlier: the Q-Network, the Target Network (a copy of the Q-Network), and an Experience Replay buffer.1. The Q-NetworkWe need a neural network that takes a state representation as input and outputs the estimated Q-values for each possible action. Since the CartPole state space is relatively small (4 dimensions) and actions are discrete (2 actions), a simple Multi-Layer Perceptron (MLP) will suffice.import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F class QNetwork(nn.Module): """Actor (Policy) Model.""" def __init__(self, state_size, action_size, seed, fc1_units=64, fc2_units=64): """Initialize parameters and build model. Params ====== state_size (int): Dimension of each state action_size (int): Dimension of each action seed (int): Random seed fc1_units (int): Number of nodes in first hidden layer fc2_units (int): Number of nodes in second hidden layer """ super(QNetwork, self).__init__() self.seed = torch.manual_seed(seed) self.fc1 = nn.Linear(state_size, fc1_units) self.fc2 = nn.Linear(fc1_units, fc2_units) self.fc3 = nn.Linear(fc2_units, action_size) def forward(self, state): """Build a network that maps state -> action values.""" x = F.relu(self.fc1(state)) x = F.relu(self.fc2(x)) return self.fc3(x) This network has an input layer matching the state size (4), two hidden layers with 64 units each and ReLU activation functions, and an output layer with units equal to the action size (2), providing the Q-value for pushing left and pushing right.2. Experience Replay BufferTo store transitions and sample them for learning, we implement a replay buffer. A collections.deque is often used for efficient appending and popping.import random import torch import numpy as np from collections import deque, namedtuple class ReplayBuffer: """Fixed-size buffer to store experience tuples.""" def __init__(self, action_size, buffer_size, batch_size, seed, device): """Initialize a ReplayBuffer object. Params ====== action_size (int): dimension of each action buffer_size (int): maximum size of buffer batch_size (int): size of each training batch seed (int): random seed device (string): 'cpu' or 'cuda' """ self.action_size = action_size self.memory = deque(maxlen=buffer_size) self.batch_size = batch_size self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"]) self.seed = random.seed(seed) self.device = device def add(self, state, action, reward, next_state, done): """Add a new experience to memory.""" e = self.experience(state, action, reward, next_state, done) self.memory.append(e) def sample(self): """Randomly sample a batch of experiences from memory.""" experiences = random.sample(self.memory, k=self.batch_size) # Convert batch of Experiences to tensors on the specified device states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(self.device) actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(self.device) rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(self.device) next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(self.device) dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device) return (states, actions, rewards, next_states, dones) def __len__(self): """Return the current size of internal memory.""" return len(self.memory) This buffer stores Experience tuples and provides methods to add a new experience and sample a random batch for training.The DQN AgentNow we combine these components into a single Agent class. This class will manage the Q-Network, Target Network, Replay Buffer, and the learning process.import numpy as np import random from collections import namedtuple, deque # Define Hyperparameters (example values) BUFFER_SIZE = int(1e5) # Replay buffer size BATCH_SIZE = 64 # Minibatch size GAMMA = 0.99 # Discount factor TAU = 1e-3 # For soft update of target parameters LR = 5e-4 # Learning rate UPDATE_EVERY = 4 # How often to update the network # Check for GPU availability device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") class Agent(): """Interacts with and learns from the environment.""" def __init__(self, state_size, action_size, seed): """Initialize an Agent object. Params ====== state_size (int): dimension of each state action_size (int): dimension of each action seed (int): random seed """ self.state_size = state_size self.action_size = action_size self.seed = random.seed(seed) # Q-Network self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device) self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device) self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR) # Replay memory self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed, device) # Initialize time step (for updating every UPDATE_EVERY steps) self.t_step = 0 def step(self, state, action, reward, next_state, done): # Save experience in replay memory self.memory.add(state, action, reward, next_state, done) # Learn every UPDATE_EVERY time steps. self.t_step = (self.t_step + 1) % UPDATE_EVERY if self.t_step == 0: # If enough samples are available in memory, get random subset and learn if len(self.memory) > BATCH_SIZE: experiences = self.memory.sample() self.learn(experiences, GAMMA) def act(self, state, eps=0.): """Returns actions for given state as per current policy. Params ====== state (array_like): current state eps (float): epsilon, for epsilon-greedy action selection """ state = torch.from_numpy(state).float().unsqueeze(0).to(device) self.qnetwork_local.eval() # Set network to evaluation mode with torch.no_grad(): action_values = self.qnetwork_local(state) self.qnetwork_local.train() # Set network back to training mode # Epsilon-greedy action selection if random.random() > eps: # Choose the best action (exploitation) return np.argmax(action_values.cpu().data.numpy()) else: # Choose a random action (exploration) return random.choice(np.arange(self.action_size)) def learn(self, experiences, gamma): """Update value parameters using given batch of experience tuples. Params ====== experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples gamma (float): discount factor """ states, actions, rewards, next_states, dones = experiences # Get max predicted Q values (for next states) from target model # We detach() the output Q_targets_next from the graph -> no gradients calculated for target network Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1) # Compute Q targets for current states # Target = Reward + Gamma * Q_Target(next_state, max_action) * (1 - done) # If done is 1, the future reward is 0 Q_targets = rewards + (gamma * Q_targets_next * (1 - dones)) # Get expected Q values from local model # We want the Q value for the action that was actually taken Q_expected = self.qnetwork_local(states).gather(1, actions) # Compute loss (Mean Squared Error or Huber Loss) loss = F.mse_loss(Q_expected, Q_targets) # Minimize the loss self.optimizer.zero_grad() # Clear previous gradients loss.backward() # Compute gradients self.optimizer.step() # Update weights # ------------------- update target network ------------------- # self.soft_update(self.qnetwork_local, self.qnetwork_target, TAU) def soft_update(self, local_model, target_model, tau): """Soft update model parameters. θ_target = τ*θ_local + (1 - τ)*θ_target Params ====== local_model (PyTorch model): weights will be copied from target_model (PyTorch model): weights will be copied to tau (float): interpolation parameter """ for target_param, local_param in zip(target_model.parameters(), local_model.parameters()): target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data) Important aspects of the Agent:Initialization (__init__): Creates both the local and target Q-Networks (initially identical), sets up the Adam optimizer for the local network, and initializes the replay buffer.Step (step): Called at each timestep. It stores the experience (s, a, r, s', done) in the buffer and triggers the learning process every UPDATE_EVERY steps, provided the buffer is sufficiently full.Act (act): Implements the epsilon-greedy policy. With probability epsilon, it chooses a random action (exploration); otherwise, it queries the local Q-network for the action with the highest estimated Q-value (exploitation). Note the use of eval() mode during action selection to disable dropout or batch normalization updates.Learn (learn): This is the core training logic:Samples a batch of experiences.Calculates the target Q-values using the target network and the Bellman equation: $Target = r + \gamma \max_{a'} Q_{target}(s', a')$. The (1 - dones) term ensures the future value is zero for terminal states. detach() is used to prevent gradients from flowing into the target network parameters during this calculation.Calculates the predicted Q-values for the actions actually taken in the batch using the local network: $Predicted = Q_{local}(s, a)$. The .gather(1, actions) part selects the Q-values corresponding to the specific actions stored in the batch.Computes the loss (MSE loss) between the Q_targets and Q_expected.Performs backpropagation and updates the weights of the local network using the optimizer.Calls soft_update to slowly blend the weights of the local network into the target network.Soft Update (soft_update): Gradually updates the target network's weights towards the local network's weights using the parameter TAU. This provides more stability than directly copying weights infrequently.The Training LoopFinally, we need the main script to initialize the environment and the agent, and run the training episodes.import gymnasium as gym from collections import deque import matplotlib.pyplot as plt # Initialize environment and agent env = gym.make('CartPole-v1') state_size = env.observation_space.shape[0] action_size = env.action_space.n agent = Agent(state_size=state_size, action_size=action_size, seed=0) def train_dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995): """Deep Q-Learning Training Loop. Params ====== n_episodes (int): maximum number of training episodes max_t (int): maximum number of timesteps per episode eps_start (float): starting value of epsilon, for epsilon-greedy action selection eps_end (float): minimum value of epsilon eps_decay (float): multiplicative factor (per episode) for decreasing epsilon """ scores = [] # list containing scores from each episode scores_window = deque(maxlen=100) # last 100 scores eps = eps_start # initialize epsilon solved_threshold = 195.0 # Avg score threshold to consider env solved (for CartPole-v1) for i_episode in range(1, n_episodes+1): state, info = env.reset() score = 0 for t in range(max_t): action = agent.act(state, eps) next_state, reward, terminated, truncated, info = env.step(action) done = terminated or truncated # Episode ends if terminated or truncated agent.step(state, action, reward, next_state, done) state = next_state score += reward if done: break scores_window.append(score) # save most recent score scores.append(score) # save most recent score eps = max(eps_end, eps_decay*eps) # decrease epsilon print(f'\rEpisode {i_episode}\tAverage Score: {np.mean(scores_window):.2f}', end="") if i_episode % 100 == 0: print(f'\rEpisode {i_episode}\tAverage Score: {np.mean(scores_window):.2f}') if np.mean(scores_window)>= solved_threshold: print(f'\nEnvironment solved in {i_episode-100:d} episodes!\tAverage Score: {np.mean(scores_window):.2f}') # Save the trained model weights torch.save(agent.qnetwork_local.state_dict(), 'dqn_cartpole_weights.pth') break return scores # Start training scores = train_dqn() # Plot the scores fig = plt.figure() ax = fig.add_subplot(111) plt.plot(np.arange(len(scores)), scores) plt.ylabel('Score') plt.xlabel('Episode #') plt.title('DQN Training Performance on CartPole-v1') plt.grid(True) plt.show()This loop runs for a set number of episodes (n_episodes). In each episode:The environment is reset.The agent interacts with the environment step-by-step (act, env.step).Each transition is processed by the agent (agent.step), potentially triggering a learning update.The score (total reward) is tracked.Epsilon is decayed after each episode to reduce exploration over time.Progress (average score over the last 100 episodes) is printed.If the average score reaches a target threshold (e.g., 195 for CartPole-v1), training stops, and the learned weights are saved.After training, a plot shows the score per episode, which should ideally show an upward trend as the agent learns.{ "data": [ { "y": [10, 12, 15, 11, 18, 25, 22, 30, 35, 40, 55, 60, 75, 90, 110, 130, 150, 170, 185, 195, 200, 200, 198, 200, 200], "x": [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180], "mode": "lines+markers", "type": "scatter", "name": "Episode Score", "marker": {"color": "#228be6"}, "line": {"color": "#228be6"} }, { "y": [195, 195], "x": [1, 180], "mode": "lines", "type": "scatter", "name": "Solved Threshold", "line": {"color": "#f03e3e", "dash": "dash"} } ], "layout": { "title": "Sample DQN Training Progress on CartPole", "xaxis": {"title": "Episode #"}, "yaxis": {"title": "Score (Episode Length)", "range": [0, 220]}, "hovermode": "x unified", "template": "plotly_white" } }Sample learning curve showing episode scores increasing over time and eventually crossing the 'solved' threshold.SummaryIn this hands-on section, we implemented a functional Deep Q-Network agent from scratch. We defined the network architecture, the experience replay mechanism, and the agent logic incorporating epsilon-greedy action selection, learning from sampled batches, and target network updates. By applying it to the CartPole environment, we demonstrated how DQN can learn effective policies in environments where tabular methods would be impractical. This forms a solid foundation for tackling more complex problems and understanding the enhancements discussed in the next chapter.