Okay, let's put theory into practice. We've discussed why function approximation is needed for large state spaces and how linear methods combined with semi-gradient descent offer a way forward. Now, we'll implement value function approximation using linear methods for a classic RL control problem: Mountain Car.
In the Mountain Car problem (from the Gymnasium library), an underpowered car is situated in a valley and must drive up the right-hand hill to reach a goal. Since gravity is stronger than the car's engine, it cannot simply drive straight up. It must learn to build momentum by driving back and forth between the hills.
The state s is continuous, defined by two variables:
Because the state space is continuous, we cannot use a simple table to store values for every possible state. This makes it an ideal candidate for function approximation. Our goal here will be to estimate the state-value function V(s) under a fixed, simple policy using Semi-gradient TD(0). We'll approximate V(s) with a linear function:
v^(s,w)=wTx(s)=i=1∑dwixi(s)where x(s) is a feature vector derived from the state s, and w is the weight vector we need to learn.
A common and effective way to create features for linear function approximation in continuous state spaces is tile coding. Imagine overlaying several grids (tilings) onto the state space, each slightly offset from the others. For a given state, we identify which tile it falls into within each grid. The feature vector x(s) becomes a large binary vector where each component corresponds to one tile in one tiling. A component is 1 if the state falls into the corresponding tile and 0 otherwise.
This approach discretizes the space in a distributed manner. Each state activates multiple features (one per tiling), and similar states activate many of the same features, allowing for generalization. States that are close together will share more active tiles than states that are far apart.
For our Mountain Car example, we can define the number of tilings and the resolution of each grid (number of tiles per dimension). A state (p,v) will then activate one tile in each of the tilings.
# Example configuration for tile coding
num_tilings = 8
num_tiles_per_dim = 8 # Creates an 8x8 grid for each tiling
# State space dimensions for normalization/scaling
pos_min, pos_max = -1.2, 0.6
vel_min, vel_max = -0.07, 0.07
# Total number of features = num_tilings * (num_tiles_per_dim * num_tiles_per_dim)
total_features = num_tilings * (num_tiles_per_dim ** 2)
# Function to get active features (tile indices) for a state
# Note: A real implementation often uses libraries like `TileCoder`
# This is a conceptual function placeholder
def get_feature_vector(state, num_tilings, tiles_per_dim, total_features):
position, velocity = state
feature_indices = [] # Indices of active tiles (features)
# --- Placeholder for actual tile coding logic ---
# This logic would calculate which tile the state falls into
# for each of the 'num_tilings', considering offsets.
# For simplicity, let's imagine it returns a list of active indices.
# Example: feature_indices = compute_active_tiles(state, config)
# -------------------------------------------------
# Create the binary feature vector
x = np.zeros(total_features)
# Assume feature_indices contains the indices of the '1' bits
# based on the active tiles calculated above.
# In a real implementation, tile coding gives you these indices directly.
# For this placeholder, we'll just simulate activating a few features.
# Replace this with actual tile calculation using state, num_tilings etc.
for i in range(num_tilings):
# Simplified hash/index calculation for demonstration
idx = hash((round(position*10), round(velocity*100), i)) % total_features
if idx >= 0 and idx < total_features:
feature_indices.append(idx)
if feature_indices:
x[np.array(feature_indices, dtype=int)] = 1.0
return x
# --- A more concrete, simplified grid tiling example ---
# (Alternative to the conceptual placeholder above for understanding)
def get_simple_grid_features(state, num_tilings, tiles_per_dim, total_features):
position, velocity = state
pos_scale = tiles_per_dim / (pos_max - pos_min)
vel_scale = tiles_per_dim / (vel_max - vel_min)
features = np.zeros(total_features)
for i in range(num_tilings):
# Apply a simple offset for each tiling
offset_factor = i / num_tilings
pos_offset = offset_factor * (pos_max - pos_min) / tiles_per_dim
vel_offset = offset_factor * (vel_max - vel_min) / tiles_per_dim
pos_shifted = position + pos_offset
vel_shifted = velocity + vel_offset
# Find tile indices for this tiling
pos_tile = int((pos_shifted - pos_min) * pos_scale)
vel_tile = int((vel_shifted - vel_min) * vel_scale)
# Clamp indices to be within bounds
pos_tile = max(0, min(tiles_per_dim - 1, pos_tile))
vel_tile = max(0, min(tiles_per_dim - 1, vel_tile))
# Calculate the flat index for this tile in this tiling
base_index = i * (tiles_per_dim ** 2)
tile_index = base_index + vel_tile * tiles_per_dim + pos_tile
if 0 <= tile_index < total_features:
features[tile_index] = 1.0 # Activate this feature
return features
Note: The get_simple_grid_features
function provides a basic grid tiling implementation. Proper tile coding often involves more sophisticated hashing and offset strategies for better generalization, but this gives the core idea.
Now we'll use the Semi-gradient TD(0) algorithm to learn the weights w for our linear value function approximator v^(s,w). We'll evaluate a simple, fixed policy: always accelerate in the direction corresponding to the action index 2 (accelerate right).
The update rule for the weights w at each step, given a transition from state S to S′ with reward R, is:
w←w+α[R+γv^(S′,w)−v^(S,w)]∇v^(S,w)
Since v^(S,w)=wTx(S), the gradient with respect to w is simply the feature vector x(S). The update becomes:
w←w+α[R+γwTx(S′)−wTx(S)]x(S)Let's implement this learning loop.
import numpy as np
import gymnasium as gym
# import matplotlib.pyplot as plt # Optional for plotting
# --- Parameters ---
alpha = 0.1 / num_tilings # Learning rate, often scaled by number of active features
gamma = 1.0 # Discount factor (Mountain Car is episodic, gamma=1 is common)
num_episodes = 5000
# Use the simple grid tiling implementation
feature_func = get_simple_grid_features
# --- Initialization ---
weights = np.zeros(total_features)
env = gym.make('MountainCar-v0')
# Helper function for prediction
def predict_value(state, w):
features = feature_func(state, num_tilings, num_tiles_per_dim, total_features)
return np.dot(w, features)
# --- Learning Loop ---
episode_rewards = [] # Track rewards per episode
print("Starting training...")
for episode in range(num_episodes):
state, info = env.reset()
done = False
total_reward = 0
step_count = 0
while not done:
# Fixed policy: always choose action 2 (accelerate right)
action = 2
# Get features for the current state S
current_features = feature_func(state, num_tilings, num_tiles_per_dim, total_features)
current_value = np.dot(weights, current_features)
# Take action, observe next state S' and reward R
next_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
total_reward += reward
# Calculate TD target
next_value = predict_value(next_state, weights) if not terminated else 0.0
td_target = reward + gamma * next_value
# Calculate TD error delta
td_error = td_target - current_value
# Update weights using Semi-gradient TD(0)
# Gradient is just the feature vector current_features
weights += alpha * td_error * current_features
# Move to the next state
state = next_state
step_count += 1
# Safety break for very long episodes if needed
if step_count > 10000:
print(f"Warning: Episode {episode + 1} exceeded 10000 steps. Breaking.")
done = True # Force break
episode_rewards.append(total_reward)
if (episode + 1) % 500 == 0:
print(f"Episode {episode + 1}/{num_episodes} finished. Total Reward: {total_reward}")
print("Training finished.")
env.close()
# --- Optional: Analyze Results ---
# You could plot episode_rewards to see if the fixed policy improves (it likely won't much,
# as the goal is value prediction here, not policy improvement).
# More interestingly, visualize the learned value function.
After training, the weights
vector holds the learned parameters. We can now estimate the value v^(s,w) for any state s. A good way to understand what the agent has learned is to plot the negative of the value function (since costs are negative rewards, higher values mean "closer to goal" or "better state"). We expect states near the goal position (p > 0.5) or states with high velocity towards the goal to have higher values (less negative).
Let's create a grid of states (position, velocity) and compute the predicted value for each point using our learned weights
. We can then visualize this as a heatmap or contour plot.
# --- Visualization Code (using Plotly) ---
import plotly.graph_objects as go
import numpy as np # Ensure numpy is imported
# Generate grid points for plotting
positions = np.linspace(pos_min, pos_max, 30) # Increased density for smoother plot
velocities = np.linspace(vel_min, vel_max, 30)
value_grid = np.zeros((len(velocities), len(positions)))
for i, vel in enumerate(velocities):
for j, pos in enumerate(positions):
state_eval = (pos, vel)
# Use the trained weights to predict value
value_grid[i, j] = predict_value(state_eval, weights)
# Create Plotly contour plot (Heatmap style)
plotly_fig = go.Figure(data=go.Contour(
z=value_grid,
x=positions,
y=velocities,
colorscale='Viridis', # Or 'Blues', 'RdBu', etc.
contours=dict(
coloring='heatmap', # Fill the contours like a heatmap
showlabels=False # Optional: show value labels on contours
),
colorbar=dict(title='State Value (V)')
))
plotly_fig.update_layout(
title='Learned State-Value Function (V) for Mountain Car (Fixed Policy)',
xaxis_title="Position",
yaxis_title="Velocity",
width=700,
height=550,
margin=dict(l=60, r=60, b=60, t=90)
)
# To display the plot (if running interactively)
# plotly_fig.show()
# To get the JSON representation for embedding:
plotly_json_string = plotly_fig.to_json(pretty=False)
# Wrap in markdown code block (remove newlines for single line)
plotly_json_single_line = ''.join(plotly_json_string.splitlines())
print("\nPlotly JSON for Value Function Visualization:")
# print(f"```plotly\n{plotly_json_single_line}\n```") # Print this in the final markdown
Estimated state-value function v^(s,w) for the Mountain Car environment under a policy that always accelerates right. Higher (less negative) values indicate states considered better by the learned approximation.
This exercise demonstrates the application of linear function approximation for value prediction in an environment with a continuous state space. Key takeaways include:
This example focused on prediction (estimating V(s) for a fixed policy). The next natural step, which builds upon this foundation, is control: learning an optimal policy by approximating the action-value function Q(s,a) using similar techniques (e.g., Semi-gradient SARSA or Q-learning with function approximation). You would typically define features x(s,a) based on both the state and the action.
© 2025 ApX Machine Learning