Let's put theory into practice by implementing the First-Visit Monte Carlo Prediction algorithm. Our goal is to estimate the state-value function Vπ for a given policy π, using only sample episodes generated by following that policy in an environment. We won't need any knowledge of the environment's transition probabilities or reward dynamics.
For this exercise, we'll use the classic Blackjack environment, available through the Gymnasium library. Recall from the previous section that MC Prediction works by averaging the returns observed after the first visit to each state across many episodes.
First, ensure you have Gymnasium installed (pip install gymnasium[classic_control]
). The Blackjack environment (Blackjack-v1
) simulates the popular card game.
(player_sum, dealer_showing_card, usable_ace)
.
player_sum
: The current sum of the player's cards (integer between 4 and 21).dealer_showing_card
: The value of the dealer's face-up card (integer between 1 and 10, where 1 represents an Ace).usable_ace
: Whether the player has an Ace that can count as 11 without busting (boolean, 0 or 1).We need a policy π to generate episodes. We'll use a simple, fixed policy often used as a baseline:
Policy π: Stick if the player's current sum is 20 or 21. Otherwise, hit.
This policy is reasonable but not optimal. Our goal is not to find the best policy yet, but simply to evaluate this specific policy using MC Prediction.
Let's outline the steps and then implement them in Python.
state_returns_sum
) and the count of visits to each state (state_visit_count
).state_visit_count[S_t] += 1
.state_returns_sum[S_t] += G
.Now, let's translate this into Python code.
import gymnasium as gym
import numpy as np
from collections import defaultdict
import random # Used only if making policy stochastic, not needed for this fixed policy
# Initialize the environment
env = gym.make('Blackjack-v1', sab=True) # sab=True uses slightly different rules, similar results
# Define the simple policy: Stick on 20 or 21, otherwise Hit.
def simple_policy(state):
"""
Args:
state: Tuple (player_sum, dealer_showing, usable_ace)
Returns:
action: 0 (stick) or 1 (hit)
"""
player_sum, _, _ = state
if player_sum >= 20:
return 0 # Stick
else:
return 1 # Hit
# MC Prediction Algorithm (First-Visit)
def mc_prediction_first_visit(policy, env, num_episodes, gamma=1.0):
"""
Performs First-Visit Monte Carlo Prediction.
Args:
policy: A function that takes a state and returns an action.
env: The OpenAI Gym environment.
num_episodes: Number of episodes to generate.
gamma: Discount factor.
Returns:
V: A dictionary mapping state -> value estimate.
"""
# Initialize dictionaries
# defaultdict avoids checking if key exists
state_returns_sum = defaultdict(float)
state_visit_count = defaultdict(int)
V = defaultdict(float) # Final value estimates
for i_episode in range(num_episodes):
# Print progress
if (i_episode + 1) % 10000 == 0:
print(f"\rEpisode {i_episode+1}/{num_episodes}", end="")
# Generate an episode following the policy
episode = []
state, _ = env.reset() # Get initial state
terminated = False
truncated = False # Handle potential truncation by gymnasium
while not terminated and not truncated:
action = policy(state)
next_state, reward, terminated, truncated, _ = env.step(action)
episode.append((state, reward)) # Store state and *next* reward
state = next_state
# Process the episode for First-Visit MC
states_in_episode = set([s for (s, r) in episode]) # Get unique states visited
visited_in_episode = set() # Track first visits within this episode
# Loop backwards through the episode
G = 0.0 # Initialize return
# The episode stores (S_t, R_{t+1}) pairs.
# We iterate backwards to calculate G_t = R_{t+1} + gamma*R_{t+2} + ...
for t in range(len(episode) - 1, -1, -1):
state, reward = episode[t]
G = gamma * G + reward
# If this is the first visit to 'state' in this episode
if state not in visited_in_episode:
visited_in_episode.add(state)
state_returns_sum[state] += G
state_visit_count[state] += 1
# Update value estimate incrementally
V[state] = state_returns_sum[state] / state_visit_count[state]
print(f"\nFinished {num_episodes} episodes.")
return V
# Run the prediction
num_episodes = 500000
V_simple_policy = mc_prediction_first_visit(simple_policy, env, num_episodes)
# Print some example values (optional)
# Example: Value of state (18, 7, False) - Player sum 18, dealer shows 7, no usable ace
print(f"\nExample state value V(18, 7, False): {V_simple_policy.get((18, 7, False), 'Not Visited')}")
# Example: Value of state (21, 10, True) - Player sum 21, dealer shows 10, usable ace
print(f"Example state value V(21, 10, True): {V_simple_policy.get((21, 10, True), 'Not Visited')}")
env.close()
simple_policy(state)
: This function implements our fixed strategy. It takes the current state tuple and returns 0 (stick) if the player's sum is 20 or 21, and 1 (hit) otherwise.mc_prediction_first_visit(...)
:
state_returns_sum
, state_visit_count
, and V
using defaultdict
for convenience. defaultdict(float)
initializes new keys with 0.0
, and defaultdict(int)
initializes with 0
.num_episodes
.env.step(policy(state))
. We store tuples of (state, reward)
where reward
is Rt+1 received after being in state
St.states_in_episode
). Note: This line isn't strictly necessary for the logic but can be useful for debugging or analysis. The visited_in_episode
set is the important one for the algorithm.t = T-1, ..., 0
).if state not in visited_in_episode:
check ensures we only process the return for the first occurrence of the state in this specific episode.visited_in_episode
, update the total returns sum and visit count for that state, and recalculate the average return, which is our estimate V(state).Estimating values for all possible Blackjack states is useful, but visualizing them provides more insight. Since the state space is three-dimensional (player_sum, dealer_showing, usable_ace)
, we can create 2D plots by fixing one dimension. Let's plot the estimated state value V(s) against the player_sum
, separately for cases with and without a usable ace, holding the dealer_showing_card
constant (e.g., dealer shows a 7).
import matplotlib.pyplot as plt
def plot_blackjack_values(V, dealer_card=7):
"""Helper function to plot V(s) against player sum"""
player_sums = np.arange(12, 22) # Relevant player sums
# Values for states without usable ace
values_no_ace = [V.get((s, dealer_card, False), 0) for s in player_sums]
# Values for states with usable ace
values_usable_ace = [V.get((s, dealer_card, True), 0) for s in player_sums]
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(player_sums, values_no_ace, label='No Usable Ace', marker='o')
ax.plot(player_sums, values_usable_ace, label='Usable Ace', marker='x')
ax.set_xlabel("Player Sum")
ax.set_ylabel("State Value (Estimated V)")
ax.set_title(f"Value Function vs Player Sum (Dealer Showing {dealer_card})")
ax.legend()
ax.grid(True)
plt.show()
# Generate the plot after running the prediction
# Using V_simple_policy calculated earlier
plot_blackjack_values(V_simple_policy, dealer_card=7)
plot_blackjack_values(V_simple_policy, dealer_card=2) # Example for a different dealer card
The plots show the estimated value (expected return) for different player sums, given the dealer shows a specific card (e.g., 7 or 2). You'll typically observe that higher player sums generally have higher values, and having a usable ace often increases the value, especially for lower sums where busting is not an immediate risk. The simple policy's effect (sticking at 20/21) should be reflected in the values near those sums.
This practical exercise demonstrates how MC Prediction learns state values directly from experience, without needing a model of the environment. By running episodes and averaging the observed returns for each state, we can effectively estimate Vπ. This forms the foundation for MC Control methods, where we'll use similar ideas to estimate action-values (Qπ) and find optimal policies.
© 2025 ApX Machine Learning