Now that we've explored the theoretical underpinnings of learning environment models and using them for planning, let's put these ideas into practice by building and experimenting with a simple model-based agent. We'll implement a basic version of the Dyna-Q algorithm, which elegantly integrates direct reinforcement learning (learning from real experience) with model learning and planning (learning from simulated experience).
This hands-on exercise aims to solidify your understanding of how a learned model can accelerate learning by allowing the agent to perform extra updates using simulated transitions. We'll use a straightforward environment to keep the focus squarely on the model-based mechanics.
To illustrate the core concepts without unnecessary complexity, we'll use a small, deterministic grid world. Imagine a 1x5 grid (5 states, numbered 0 to 4). The agent starts at state 0 and wants to reach the goal at state 4.
Our agent needs to learn the optimal policy (always move right) to reach the goal quickly.
Our Dyna-Q agent will consist of three main parts:
Direct RL: A standard Q-learning component that updates action values based on real interactions with the environment. The update rule is:
Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)−Q(s,a)]where (s,a,r,s′) is a transition experienced in the real environment.
Model Learning: A component that learns the transition function T(s,a)→s′ and the reward function R(s,a)→r. Since our grid world is deterministic, we can store the model simply as dictionaries mapping observed (s,a) pairs to their resulting (r,s′) outcomes.
model_T[s][a] = s'
model_R[s][a] = r
We only update the model for state-action pairs the agent has actually tried.Planning: A component that uses the learned model to generate simulated experiences and perform additional Q-learning updates. The planning process typically involves:
This planning step is repeated for a fixed number (n) of iterations after each real environment step.
Let's outline the core logic in pseudocode.
# Initialize Q(s, a) arbitrarily (e.g., to 0)
# Initialize Model_T and Model_R as empty dictionaries/structures
# Keep track of observed (state, action) pairs: observed_sa = set()
# Loop for each episode:
# s = initial_state
# Loop for each step of episode:
# Choose action a from s using policy derived from Q (e.g., epsilon-greedy)
# Take action a, observe real reward r and next state s'
# # 1. Direct RL Update
# Q(s, a) <- Q(s, a) + alpha * [r + gamma * max_a'(Q(s', a')) - Q(s, a)]
# # 2. Model Learning
# Update model: Store (s, a) -> (r, s')
# model_T[s][a] = s'
# model_R[s][a] = r
# Add (s, a) to observed_sa
# # 3. Planning (Repeat n times)
# Loop n times:
# Randomly sample (s_sim, a_sim) from observed_sa
# Retrieve r_sim = model_R[s_sim][a_sim]
# Retrieve s'_sim = model_T[s_sim][a_sim]
# # Planning Update (Q-learning on simulated experience)
# Q(s_sim, a_sim) <- Q(s_sim, a_sim) + alpha * [r_sim + gamma * max_a'(Q(s'_sim, a')) - Q(s_sim, a_sim)]
# s = s'
# If s is terminal, break inner loop (end episode)
Here are some Python snippets illustrating the key parts, assuming a simple NumPy-based implementation.
Initialization:
import numpy as np
import random
from collections import defaultdict
num_states = 5
num_actions = 2 # 0: left, 1: right
alpha = 0.1 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.1 # Exploration rate
n_planning_steps = 10 # Number of planning steps per real step
# Initialize Q-table
q_table = np.zeros((num_states, num_actions))
# Model (using dictionaries for sparse storage)
# defaultdict allows easy addition of new states/actions
model_T = defaultdict(lambda: defaultdict(int))
model_R = defaultdict(lambda: defaultdict(float))
observed_sa_pairs = set()
# Simple deterministic environment function
def grid_step(state, action):
if action == 0: # left
next_state = max(0, state - 1)
else: # right
next_state = min(num_states - 1, state + 1)
reward = 1.0 if next_state == num_states - 1 else 0.0
terminal = (next_state == num_states - 1)
return reward, next_state, terminal
Main Loop Snippet (Inside episode loop):
# Assuming 'current_state' holds the current state
# Choose action using epsilon-greedy
if random.random() < epsilon:
action = random.choice([0, 1]) # Explore
else:
action = np.argmax(q_table[current_state]) # Exploit
# Take action in the real environment
reward, next_state, terminal = grid_step(current_state, action)
# 1. Direct RL Update
td_target = reward + gamma * np.max(q_table[next_state]) * (not terminal)
td_error = td_target - q_table[current_state, action]
q_table[current_state, action] += alpha * td_error
# 2. Model Learning
# Use tuples for dictionary keys if state is more complex
state_action_tuple = (current_state, action)
if current_state not in model_T or action not in model_T[current_state]:
model_T[current_state][action] = next_state
model_R[current_state][action] = reward
observed_sa_pairs.add(state_action_tuple)
# 3. Planning
if observed_sa_pairs: # Check if model has any data
for _ in range(n_planning_steps):
# Sample a previously observed state-action pair
s_plan, a_plan = random.choice(list(observed_sa_pairs))
# Query the model
r_plan = model_R[s_plan][a_plan]
s_prime_plan = model_T[s_plan][a_plan]
terminal_plan = (s_prime_plan == num_states - 1) # Infer terminal from state
# Q-planning update
td_target_plan = r_plan + gamma * np.max(q_table[s_prime_plan]) * (not terminal_plan)
td_error_plan = td_target_plan - q_table[s_plan, a_plan]
q_table[s_plan, a_plan] += alpha * td_error_plan
# Update state
current_state = next_state
If you run this Dyna-Q agent and compare its learning speed (e.g., number of steps per episode to reach the goal) against a standard Q-learning agent (equivalent to Dyna-Q with n_planning_steps = 0
), you should observe that Dyna-Q learns significantly faster. The planning steps allow the agent to propagate value information learned from a single real step much more effectively through the state space via the model.
We can visualize this difference by plotting the number of steps taken per episode for both algorithms.
This chart shows illustrative learning curves comparing standard Q-learning and Dyna-Q on the simple grid task. Dyna-Q converges to the optimal policy (4 steps) much faster due to the planning updates.
This simple example demonstrates the core principle of Dyna-Q: using a learned model to supplement real experience with simulated experience, thereby accelerating learning.
Limitations & Considerations:
This hands-on exercise provides a foundation for understanding more complex model-based algorithms. Techniques like trajectory sampling with learned models, integrating search algorithms like MCTS, and handling model uncertainty are extensions built upon these fundamental ideas. Experimenting with different values of n_planning_steps
or trying a slightly more complex (perhaps stochastic) environment can provide further insights into the behavior of model-based agents.
© 2025 ApX Machine Learning