Okay, let's translate the theory of Q-learning into practice. In the preceding sections, you learned how Q-learning operates as an off-policy, model-free, temporal-difference control algorithm. Its goal is to find the optimal action-value function, Q∗, by iteratively updating estimates based on experienced transitions. Now, we will walk through the core components needed to implement the Q-learning algorithm using Python.
We won't build a complex environment here; instead, we'll focus on the agent's learning logic. Assume we have an environment that provides the necessary interactions: given an action, it returns the next state, reward, and termination status. Many reinforcement learning libraries like Gymnasium (formerly OpenAI Gym) provide such standardized environment interfaces.
At the heart of tabular Q-learning is the Q-table. This is simply a data structure, typically a matrix or a dictionary, that stores the estimated action-values Q(s,a) for every state-action pair. If our environment has ∣S∣ states and ∣A∣ actions, the Q-table will have dimensions ∣S∣×∣A∣.
We usually initialize the Q-table optimistically or pessimistically. A common approach is to initialize all Q-values to zero. In Python using NumPy, this looks like:
import numpy as np
num_states = 10 # Example: 10 discrete states
num_actions = 4 # Example: 4 possible actions
q_table = np.zeros((num_states, num_actions))
The agent learns by interacting with the environment over many episodes. Each episode consists of multiple steps. Here's the flow for a single episode:
Let's break down step 2.d, the update rule.
To ensure the agent finds the optimal policy, it needs to explore the environment sufficiently before settling on the best-known actions. A common strategy is epsilon-greedy (ϵ-greedy):
import random
epsilon = 0.1 # Exploration rate
# Assuming 'state' is the current state index
if random.uniform(0, 1) < epsilon:
action = random.randint(0, num_actions - 1) # Explore: choose a random action index
else:
action = np.argmax(q_table[state, :]) # Exploit: choose the best action index based on Q-table
Typically, ϵ starts high (e.g., 1.0) and gradually decreases over episodes (decays) to shift the balance from exploration towards exploitation as the agent learns more.
The performance of Q-learning heavily depends on its hyperparameters:
Here's a conceptual Python structure for the Q-learning training loop:
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.99
epsilon = 1.0
epsilon_decay_rate = 0.001
min_epsilon = 0.01
num_episodes = 1000
# Assume 'env' is an initialized environment object with methods like:
# env.reset() -> returns initial state
# env.step(action) -> returns next_state, reward, terminated, truncated, info
# env.observation_space.n -> number of states
# env.action_space.n -> number of actions
num_states = env.observation_space.n
num_actions = env.action_space.n
q_table = np.zeros((num_states, num_actions))
rewards_per_episode = [] # To track learning progress
for episode in range(num_episodes):
state, info = env.reset() # Get initial state as an integer index
terminated = False
truncated = False
total_episode_reward = 0
while not terminated and not truncated:
# Epsilon-greedy action selection
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample() # Explore
else:
action = np.argmax(q_table[state, :]) # Exploit
# Take action and observe outcome
next_state, reward, terminated, truncated, info = env.step(action)
# Q-learning update rule
best_next_action_value = np.max(q_table[next_state, :])
td_target = reward + discount_factor * best_next_action_value
td_error = td_target - q_table[state, action]
q_table[state, action] = q_table[state, action] + learning_rate * td_error
# Update state and total reward
state = next_state
total_episode_reward += reward
# Decay epsilon
epsilon = max(min_epsilon, epsilon - epsilon_decay_rate)
# Store episode reward (for plotting later)
rewards_per_episode.append(total_episode_reward)
if (episode + 1) % 100 == 0:
print(f"Episode {episode + 1}: Total Reward: {total_episode_reward}, Epsilon: {epsilon:.3f}")
print("Training finished.")
# After training, the q_table contains the learned action-values.
# The optimal policy can be derived by choosing the action with the highest Q-value for each state.
Tracking metrics like the total reward per episode is useful for understanding if the agent is learning. Plotting this can show convergence.
Total reward accumulated by the agent in each training episode. An upward trend generally indicates successful learning. (Example data shown).
This hands-on section provided the structure and logic for implementing Q-learning. You initialized a Q-table, implemented the core update loop incorporating the Q-learning rule and epsilon-greedy exploration, and considered hyperparameter settings. By running this process over many episodes, the agent iteratively improves its Q-value estimates, ultimately learning a policy to maximize cumulative rewards. Remember that Q-learning is off-policy, meaning it learns the optimal Q∗ function even while potentially behaving sub-optimally due to exploration.
© 2025 ApX Machine Learning