Let's put theory into practice. We've discussed why function approximation is needed for large state spaces and how linear methods combined with semi-gradient descent offer a way forward. Now, we'll implement value function approximation using linear methods for a classic RL control problem: Mountain Car.In the Mountain Car problem (from the Gymnasium library), an underpowered car is situated in a valley and must drive up the right-hand hill to reach a goal. Since gravity is stronger than the car's engine, it cannot simply drive straight up. It must learn to build momentum by driving back and forth between the hills.The state $s$ is continuous, defined by two variables:Position $p \in [-1.2, 0.6]$Velocity $v \in [-0.07, 0.07]$Because the state space is continuous, we cannot use a simple table to store values for every possible state. This makes it an ideal candidate for function approximation. Our goal here will be to estimate the state-value function $V(s)$ under a fixed, simple policy using Semi-gradient TD(0). We'll approximate $V(s)$ with a linear function:$$ \hat{v}(s, \mathbf{w}) = \mathbf{w}^T \mathbf{x}(s) = \sum_{i=1}^{d} w_i x_i(s) $$where $\mathbf{x}(s)$ is a feature vector derived from the state $s$, and $\mathbf{w}$ is the weight vector we need to learn.Feature Engineering: Tile CodingA common and effective way to create features for linear function approximation in continuous state spaces is tile coding. Imagine overlaying several grids (tilings) onto the state space, each slightly offset from the others. For a given state, we identify which tile it falls into within each grid. The feature vector $\mathbf{x}(s)$ becomes a large binary vector where each component corresponds to one tile in one tiling. A component is 1 if the state falls into the corresponding tile and 0 otherwise.This approach discretizes the space in a distributed manner. Each state activates multiple features (one per tiling), and similar states activate many of the same features, allowing for generalization. States that are close together will share more active tiles than states that are far apart.For our Mountain Car example, we can define the number of tilings and the resolution of each grid (number of tiles per dimension). A state $(p, v)$ will then activate one tile in each of the tilings.# Example configuration for tile coding num_tilings = 8 num_tiles_per_dim = 8 # Creates an 8x8 grid for each tiling # State space dimensions for normalization/scaling pos_min, pos_max = -1.2, 0.6 vel_min, vel_max = -0.07, 0.07 # Total number of features = num_tilings * (num_tiles_per_dim * num_tiles_per_dim) total_features = num_tilings * (num_tiles_per_dim ** 2) # Function to get active features (tile indices) for a state # Note: A real implementation often uses libraries like `TileCoder` # This is a function placeholder def get_feature_vector(state, num_tilings, tiles_per_dim, total_features): position, velocity = state feature_indices = [] # Indices of active tiles (features) # --- Placeholder for actual tile coding logic --- # This logic would calculate which tile the state falls into # for each of the 'num_tilings', considering offsets. # For simplicity, let's imagine it returns a list of active indices. # Example: feature_indices = compute_active_tiles(state, config) # ------------------------------------------------- # Create the binary feature vector x = np.zeros(total_features) # Assume feature_indices contains the indices of the '1' bits # based on the active tiles calculated above. # In a real implementation, tile coding gives you these indices directly. # For this placeholder, we'll just simulate activating a few features. # Replace this with actual tile calculation using state, num_tilings etc. for i in range(num_tilings): # Simplified hash/index calculation for demonstration idx = hash((round(position*10), round(velocity*100), i)) % total_features if idx >= 0 and idx < total_features: feature_indices.append(idx) if feature_indices: x[np.array(feature_indices, dtype=int)] = 1.0 return x # --- A more concrete, simplified grid tiling example --- # (Alternative to the placeholder above for understanding) def get_simple_grid_features(state, num_tilings, tiles_per_dim, total_features): position, velocity = state pos_scale = tiles_per_dim / (pos_max - pos_min) vel_scale = tiles_per_dim / (vel_max - vel_min) features = np.zeros(total_features) for i in range(num_tilings): # Apply a simple offset for each tiling offset_factor = i / num_tilings pos_offset = offset_factor * (pos_max - pos_min) / tiles_per_dim vel_offset = offset_factor * (vel_max - vel_min) / tiles_per_dim pos_shifted = position + pos_offset vel_shifted = velocity + vel_offset # Find tile indices for this tiling pos_tile = int((pos_shifted - pos_min) * pos_scale) vel_tile = int((vel_shifted - vel_min) * vel_scale) # Clamp indices to be within bounds pos_tile = max(0, min(tiles_per_dim - 1, pos_tile)) vel_tile = max(0, min(tiles_per_dim - 1, vel_tile)) # Calculate the flat index for this tile in this tiling base_index = i * (tiles_per_dim ** 2) tile_index = base_index + vel_tile * tiles_per_dim + pos_tile if 0 <= tile_index < total_features: features[tile_index] = 1.0 # Activate this feature return features Note: The get_simple_grid_features function provides a basic grid tiling implementation. Proper tile coding often involves more sophisticated hashing and offset strategies for better generalization, but this gives the core idea.Implementing Semi-gradient TD(0) for PredictionNow we'll use the Semi-gradient TD(0) algorithm to learn the weights $\mathbf{w}$ for our linear value function approximator $\hat{v}(s, \mathbf{w})$. We'll evaluate a simple, fixed policy: always accelerate in the direction corresponding to the action index 2 (accelerate right).The update rule for the weights $\mathbf{w}$ at each step, given a transition from state $S$ to $S'$ with reward $R$, is:$\mathbf{w} \leftarrow \mathbf{w} + \alpha [R + \gamma \hat{v}(S', \mathbf{w}) - \hat{v}(S, \mathbf{w})] \nabla \hat{v}(S, \mathbf{w})$Since $\hat{v}(S, \mathbf{w}) = \mathbf{w}^T \mathbf{x}(S)$, the gradient with respect to $\mathbf{w}$ is simply the feature vector $\mathbf{x}(S)$. The update becomes:$$ \mathbf{w} \leftarrow \mathbf{w} + \alpha [R + \gamma \mathbf{w}^T \mathbf{x}(S') - \mathbf{w}^T \mathbf{x}(S)] \mathbf{x}(S) $$Let's implement this learning loop.import numpy as np import gymnasium as gym # import matplotlib.pyplot as plt # Optional for plotting # --- Parameters --- alpha = 0.1 / num_tilings # Learning rate, often scaled by number of active features gamma = 1.0 # Discount factor (Mountain Car is episodic, gamma=1 is common) num_episodes = 5000 # Use the simple grid tiling implementation feature_func = get_simple_grid_features # --- Initialization --- weights = np.zeros(total_features) env = gym.make('MountainCar-v0') # Helper function for prediction def predict_value(state, w): features = feature_func(state, num_tilings, num_tiles_per_dim, total_features) return np.dot(w, features) # --- Learning Loop --- episode_rewards = [] # Track rewards per episode print("Starting training...") for episode in range(num_episodes): state, info = env.reset() done = False total_reward = 0 step_count = 0 while not done: # Fixed policy: always choose action 2 (accelerate right) action = 2 # Get features for the current state S current_features = feature_func(state, num_tilings, num_tiles_per_dim, total_features) current_value = np.dot(weights, current_features) # Take action, observe next state S' and reward R next_state, reward, terminated, truncated, info = env.step(action) done = terminated or truncated total_reward += reward # Calculate TD target next_value = predict_value(next_state, weights) if not terminated else 0.0 td_target = reward + gamma * next_value # Calculate TD error delta td_error = td_target - current_value # Update weights using Semi-gradient TD(0) # Gradient is just the feature vector current_features weights += alpha * td_error * current_features # Move to the next state state = next_state step_count += 1 # Safety break for very long episodes if needed if step_count > 10000: print(f"Warning: Episode {episode + 1} exceeded 10000 steps. Breaking.") done = True # Force break episode_rewards.append(total_reward) if (episode + 1) % 500 == 0: print(f"Episode {episode + 1}/{num_episodes} finished. Total Reward: {total_reward}") print("Training finished.") env.close() # --- Optional: Analyze Results --- # You could plot episode_rewards to see if the fixed policy improves (it likely won't much, # as the goal is value prediction here, not policy improvement). # More interestingly, visualize the learned value function.Visualizing the Learned Value FunctionAfter training, the weights vector holds the learned parameters. We can now estimate the value $\hat{v}(s, \mathbf{w})$ for any state $s$. A good way to understand what the agent has learned is to plot the negative of the value function (since costs are negative rewards, higher values mean "closer to goal" or "better state"). We expect states near the goal position (p > 0.5) or states with high velocity towards the goal to have higher values (less negative).Let's create a grid of states (position, velocity) and compute the predicted value for each point using our learned weights. We can then visualize this as a heatmap or contour plot.# --- Visualization Code (using Plotly) --- import plotly.graph_objects as go import numpy as np # Ensure numpy is imported # Generate grid points for plotting positions = np.linspace(pos_min, pos_max, 30) # Increased density for smoother plot velocities = np.linspace(vel_min, vel_max, 30) value_grid = np.zeros((len(velocities), len(positions))) for i, vel in enumerate(velocities): for j, pos in enumerate(positions): state_eval = (pos, vel) # Use the trained weights to predict value value_grid[i, j] = predict_value(state_eval, weights) # Create Plotly contour plot (Heatmap style) plotly_fig = go.Figure(data=go.Contour( z=value_grid, x=positions, y=velocities, colorscale='Viridis', # Or 'Blues', 'RdBu', etc. contours=dict( coloring='heatmap', # Fill the contours like a heatmap showlabels=False # Optional: show value labels on contours ), colorbar=dict(title='State Value (V)') )) plotly_fig.update_layout( title='Learned State-Value Function (V) for Mountain Car (Fixed Policy)', xaxis_title="Position", yaxis_title="Velocity", width=700, height=550, margin=dict(l=60, r=60, b=60, t=90) ) # To display the plot (if running interactively) # plotly_fig.show() # To get the JSON representation for embedding: plotly_json_string = plotly_fig.to_json(pretty=False) # Wrap in markdown code block (remove newlines for single line) plotly_json_single_line = ''.join(plotly_json_string.splitlines()) print("\nPlotly JSON for Value Function Visualization:") # print(f"```plotly\n{plotly_json_single_line}\n```") # Print this in the final markdown{"data": [{"type": "contour", "z": [[-608.80980753, -597.61383293, -585.92481309, -574.08957289, -562.2366798, -550.55641627, -538.97430699, -527.58222753, -516.30563675, -505.11911064, -493.99259677, -482.90301368, -471.83641239, -460.80021598, -449.81282785, -438.91582801, -428.12273816, -417.47241736, -407.01363341, -396.79643593, -386.87001981, -377.28034283, -368.06907824, -359.26954295, -350.91297219, -343.03158278, -335.64632135, -328.77556274, -322.43467948, -316.63512743], [-601.74172495, -590.67869339, -579.17890082, -567.50199565, -555.83225396, -544.32980123, -532.93073121, -521.71817113, -510.62270799, -499.61549384, -488.67184905, -477.77364622, -466.91089128, -456.09272673, -445.33848445, -434.68970876, -424.16989136, -413.81743875, -403.68047131, -393.79636343, -384.21046295, -374.96445276, -366.09595118, -357.63579481, -349.61352532, -342.05683184, -334.98897842, -328.42631452, -322.38152531, -316.86290322], [-594.50935967, -583.58078714, -572.26199345, -560.73047071, -549.19161271, -537.80917062, -526.52355381, -515.42193188, -504.43913246, -493.54822966, -482.72752432, -471.96118766, -461.23861259, -450.56868666, -439.97004239, -429.48229923, -419.12771366, -408.94331639, -398.9756466, -389.25858266, -379.82990296, -370.72720488, -361.98386716, -353.62824241, -345.68666316, -338.18424624, -331.14242847, -324.57482102, -318.49264584, -312.90299126], [-587.27466298, -576.48224214, -565.35660658, -554.0040368, -542.6375919, -531.42297194, -520.30204454, -509.35757694, -498.52948561, -487.79386082, -477.13080957, -466.5265224, -455.97158566, -445.47478127, -435.0546454, -424.74982084, -414.58036877, -404.5803889, -394.78951886, -385.2394009, -375.96407593, -367.00023236, -358.38127323, -350.13494914, -342.28778472, -334.86521519, -327.8888622, -321.37274858, -315.32900224, -309.76443647], [-580.19558627, -569.54187174, -558.59382841, -547.43716299, -536.28727341, -525.29688028, -514.40260803, -503.6891428, -493.09904852, -482.60999096, -472.19883779, -461.85296708, -451.56265347, -441.33639887, -431.19205063, -421.16615556, -411.28588987, -401.5838513, -392.09624913, -382.85403831, -373.8898505, -365.23957027, -356.93669617, -349.00902945, -341.48241841, -334.38249992, -327.7292084, -321.53527006, -315.81205679, -310.56928809], [-573.43108193, -562.91794868, -552.14961951, -541.19219589, -530.26164804, -519.51681065, -508.89666601, -498.48207813, -488.21935243, -478.08646667, -468.06141963, -458.12238903, -448.25945893, -438.48058925, -428.79988678, -419.24775865, -409.84365012, -400.61383331, -391.59006941, -382.80152228, -374.27712561, -366.04821297, -358.14480009, -350.59301219, -343.41767674, -336.64342934, -330.28997553, -324.37511834, -318.91275924, -313.91301582], [-566.93122336, -556.56102764, -545.97508667, -535.22208275, -524.52262118, -513.9991544, -503.60905894, -493.43087741, -483.41063922, -473.52596134, -463.75547546, -454.07858285, -444.48509141, -434.98231831, -425.58299021, -416.31514527, -407.1941004, -398.24109987, -389.48341784, -380.9447945, -372.65093266, -364.62999333, -356.90981967, -349.51639416, -342.47464197, -335.80926862, -329.5399966, -323.68580859, -318.25980454, -313.27221653], [-560.64578703, -550.42221757, -540.01976742, -529.48216372, -519.0234584, -508.76022567, -498.65301901, -488.77419723, -479.0708797, -469.52006698, -460.09981188, -450.79005978, -441.57988468, -432.47591997, -423.48969536, -414.64592538, -405.9629774, -397.46170097, -389.16970805, -381.11183521, -373.31158419, -365.7926606, -358.58213109, -351.70681195, -345.19307287, -339.06711381, -333.34806841, -328.05473735, -323.19972715, -318.79403922], [-554.52414208, -544.45007589, -534.23552193, -523.92077345, -513.62434367, -503.56101339, -493.68094422, -484.05202473, -474.62199589, -465.36787148, -456.26706665, -447.29721179, -447.29721179, -438.43512176, -429.66739804, -421.00350431, -412.45951171, -404.05802944, -395.81744364, -387.76083993, -379.91016304, -372.28638964, -364.90987949, -357.8009177, -351.0770224, -344.66572588, -338.59434541, -332.88196645, -327.5474006, -322.60104339], [-548.51530144, -538.59325818, -528.5692251, -518.4836155, -508.45494278, -498.6006126, -488.87263326, -479.43556448, -470.23600467, -461.25023937, -452.45474781, -443.82593135, -435.34067615, -426.97486885, -418.70440521, -410.50487682, -402.3518117, -394.22070858, -386.08718783, -377.92717334, -369.71710902, -361.43488655, -353.05935647, -344.56978045, -335.94619647, -327.16965079, -318.22198956, -309.08610184, -299.74561889, -290.18504557], [-542.56832133, -532.80062318, -522.96943284, -513.11516357, -503.35632864, -493.71023132, -484.23137482, -474.98026218, -465.90307118, -456.97518111, -448.17215548, -439.46952993, -430.84327449, -422.26902923, -413.72261183, -405.17979444, -396.61642832, -387.99861323, -379.29204931, -370.46225151, -361.47482379, -352.3000932, -342.89842677, -333.2424713, -323.29682649, -313.03616178, -302.43714326, -291.47836478, -280.13934162, -268.39972885], [-536.63230039, -527.0213547, -517.38541442, -507.76489278, -498.27829296, -488.94281812, -479.79334927, -470.88084634, -462.15120709, -453.57904265, -445.1392985, -436.80738047, -428.55908838, -420.37006693, -412.21589316, -404.07212778, -395.91436265, -387.70829304, -379.4197142, -370.99899265, -362.40456618, -353.58888183, -344.49755887, -335.08781295, -325.31772614, -315.15153963, -304.56569613, -293.53828306, -282.04818823, -270.07491984], [-530.65639057, -521.20056035, -511.75823314, -502.36981116, -493.15378758, -484.12755555, -475.32599598, -466.79999082, -458.49522288, -450.38599166, -442.44671337, -434.65191731, -426.97720915, -419.39822494, -411.8906626, -404.42998029, -396.99165334, -389.55120021, -382.0842007, -374.56640158, -366.95560664, -359.20384137, -351.25510335, -343.04645208, -334.51821074, -325.61957596, -316.31698644, -306.57690129, -296.37841057, -285.69960278], [-524.59000842, -515.29000656, -506.0419953, -496.88546122, -487.9388899, -479.21976686, -470.76297288, -462.61938874, -454.73469723, -447.0832326, -439.63928692, -432.37715429, -425.27122845, -418.2971356, -411.43056617, -404.64728574, -397.92313049, -391.23399126, -384.5557508, -377.86436433, -371.13588027, -364.34642242, -357.47220228, -350.48953241, -343.37482948, -336.08455897, -328.57511873, -320.80291909, -312.72440245, -304.30490554], [-518.38261016, -509.23864907, -500.18515517, -491.26161407, -482.5864994, -474.17728581, -466.07004895, -458.30087446, -450.81544799, -443.58806305, -436.59303691, -429.8047255, -423.19752463, -416.74587035, -410.42544601, -404.21198808, -398.0812992, -392.00929174, -385.97189623, -379.94509721, -373.8991635, -367.80621474, -361.63903512, -355.37098734, -348.97504056, -342.42407521, -335.69095524, -328.7486266, -321.57015797, -314.12647881], [-512.08409453, -503.09425308, -494.23335542, -485.53988711, -477.13222173, -469.02773288, -461.26249613, -453.8725871, -446.80369137, -440.02059449, -433.49778376, -427.20975647, -421.13101082, -415.23605598, -409.49943538, -403.89570354, -398.39943974, -392.98527783, -387.62889719, -382.30624455, -376.9933715, -371.66643087, -366.2982627, -360.86164598, -355.32946188, -349.67460126, -343.86994246, -337.88838423, -331.70280464, -325.28612984], [-505.74487329, -496.90803741, -488.2386219, -479.7750923, -471.63581117, -463.83794007, -456.41745456, -449.4092292, -442.75804858, -436.42800024, -430.39240017, -424.62468712, -419.09933885, -413.7918387, -408.67767931, -403.73241292, -398.9316924, -394.25128698, -389.66701109, -385.15487247, -380.69101177, -376.25170759, -371.81341396, -367.35282124, -362.84298183, -358.25621488, -353.56482038, -348.7411029, -343.75734449, -338.58585586], [-499.41587257, -490.73335628, -482.25673698, -474.0244792, -466.15494546, -458.66549824, -451.59191209, -444.96886158, -438.74112124, -432.87281565, -427.3362648, -422.09380971, -417.12093031, -412.39319649, -407.88631896, -403.57606665, -399.43828791, -395.44893259, -391.58407093, -387.81992784, -384.13280628, -380.49919083, -376.89578362, -373.30018898, -369.68932014, -366.03987429, -362.32874459, -358.53296771, -354.62976555, -350.59635954], [-493.14860155, -484.62301887, -476.34170977, -468.34294975, -460.74401425, -453.56197875, -446.83241871, -440.58890959, -434.78102692, -429.37291147, -424.33791386, -419.64944748, -415.28194047, -411.21086868, -407.4117854, -403.86025795, -400.53201306, -397.40288429, -394.44882454, -391.64591101, -388.97043037, -386.40000146, -383.91268388, -381.48697484, -379.09959145, -376.72763439, -374.34868702, -371.94084959, -369.48273718, -366.95252133], [-487.00027347, -478.62883741, -470.53995146, -462.77188918, -455.44182408, -448.56642968, -442.18118141, -436.31955473, -430.93021511, -425.97581812, -421.42002928, -417.22757892, -413.37318665, -409.83194489, -406.5794183, -403.59123719, -400.84316074, -398.3110623, -395.97093138, -393.79901225, -391.77196791, -389.8671402, -388.06213836, -386.33493132, -384.6637909, -383.02720978, -381.40390051, -379.77298193, -378.11402866, -376.4069345], [-481.0113118, -472.80071486, -464.90991458, -457.37608449, -450.3144981, -443.74042889, -437.68804937, -432.18973301, -427.19304429, -422.65963768, -418.55220466, -414.83358668, -411.47672914, -408.45469938, -405.74061415, -403.30761436, -401.13000775, -399.18116741, -397.43455764, -395.86377102, -394.44251603, -393.14469999, -391.94443483, -390.81605996, -389.73411928, -388.67337475, -387.60872781, -386.51522953, -385.37001495, -384.15042897], [-475.21026579, -467.1527745, -459.45126635, -452.14281482, -445.34209341, -439.0640756, -433.34213486, -428.20814467, -423.60858854, -419.49904997, -415.84131144, -412.59725439, -409.73017923, -407.20472934, -404.99470586, -403.0739905, -401.41678413, -399.99754463, -398.7911049, -397.7726149, -396.91748546, -396.20145349, -395.60059511, -395.09130515, -394.65035978, -394.25492113, -393.88257424, -393.51122883, -393.11892774, -392.6839192], [-469.61199276, -461.70888708, -454.19801014, -447.11542549, -440.57549766, -434.59268999, -429.19926696, -424.42509302, -420.21554159, -416.52498714, -413.31519304, -410.54793265, -408.1859784, -406.19220264, -404.53058667, -403.16522277, -402.06041017, -401.18071564, -400.50003019, -399.99260768, -399.63200812, -399.39199547, -399.24659228, -399.17001441, -399.13666869, -399.12127125, -399.09885832, -399.04478289, -398.93474523, -398.74475769], [-464.21624598, -456.46922717, -449.1489927, -442.2902071, -435.99723487, -430.2844395, -425.17388446, -420.69412402, -416.78862257, -413.40084459, -410.48265456, -407.98590596, -405.86255226, -404.06462585, -402.55536799, -401.29801888, -400.25782868, -399.4003661, -398.6915475, -398.10002907, -397.59527389, -397.14794994, -396.73049614, -396.31658282, -395.88085776, -395.39903033, -394.84702576, -394.19996776, -393.43384816, -392.5251528], [-459.01227202, -451.42336492, -444.29579771, -437.66412586, -431.63202391, -426.21305536, -421.42797468, -417.30494534, -413.78543183, -410.81189459, -408.32610006, -406.28012172, -404.62622194, -403.31676305, -402.30389232, -401.54005964, -400.97783496, -400.57000836, -400.26980419, -400.03056987, -399.8101333, -399.56861063, -399.26839008, -398.87387661, -398.35156007, -397.66816917, -396.79048914, -395.68514133, -394.31881482, -392.65818845], [-453.98089696, -446.54872591, -439.6107503, -433.2014256, -427.42341629, -422.28918688, -417.81748984, -413.98947865, -410.74630676, -408.02913964, -405.78002176, -403.94001559, -402.45018354, -401.25180411, -400.28730875, -399.50019377, -398.83408491, -398.23405581, -397.64604696, -397.01696036, -396.29460297, -395.42774528, -394.36546847, -393.05726346, -391.45310263, -389.50319084, -387.15790354, -384.36789868, -381.08440073, -377.25894896], [-449.09813412, -441.81990082, -435.07097847, -428.88503155, -423.36472456, -418.52151198, -414.37186427, -410.8943449, -408.0284174, -405.69834728, -403.83608803, -402.37340214, -401.24205111, -400.37480144, -399.70540465, -399.1777163, -398.73680186, -398.32863621, -397.90124912, -397.40355014, -396.78434759, -395.99154883, -394.97421226, -393.68171457, -392.0639053, -390.07108154, -387.65346685, -384.76169631, -381.3466934, -377.35971425], [-444.3361975, -437.19906029, -430.62319359, -424.64206186, -419.35822954, -414.78215101, -410.92227065, -407.75512381, -405.21614691, -403.23079641, -401.72343688, -400.61682281, -399.83470868, -399.30176898, -398.94405808, -398.69707121, -398.50570767, -398.31479987, -398.07408314, -397.73250172, -397.23880254, -396.54178721, -395.58989362, -394.33198874, -392.71830424, -390.69972869, -388.22674861, -385.24980148, -381.71977893, -377.58765158], [-439.66330865, -432.6452684, -426.22207318, -420.42618746, -415.35916576, -411.03146156, -407.44149835, -404.55581158, -402.30011666, -400.59776889, -399.37121676, -398.54309767, -398.03696399, -397.7785701, -397.69303116, -397.71586152, -397.79194897, -397.86714998, -397.88837285, -397.80376833, -397.56127791, -397.10976558, -396.3981792, -395.37571536, -393.99254567, -392.19898537, -389.94545019, -387.18227656, -383.86027247, -379.92975718], [-435.04577116, -428.11477944, -421.81141784, -416.16706083, -411.28127289, -407.16251847, -403.79926199, -401.15704898, -399.16152582, -397.73604798, -396.79996996, -396.27610608, -396.08841171, -396.16177119, -396.4218858, -396.80452074, -397.24553299, -397.68066556, -398.04676828, -398.28178783, -398.33366918, -398.15130197, -397.68381866, -396.8808602, -395.69191344, -394.06733874, -391.95748969, -389.31272382, -386.08390455, -382.22119197]], "contours": {"coloring": "heatmap", "showlabels": false}, "colorscale": "Viridis", "colorbar": {"title": {"text": "State Value (V)"}}, "x": [-1.2, -1.1379310344827587, -1.0758620689655173, -1.0137931034482758, -0.9517241379310345, -0.8896551724137931, -0.8275862068965517, -0.7655172413793103, -0.703448275862069, -0.6413793103448276, -0.5793103448275862, -0.5172413793103448, -0.45517241379310346, -0.3931034482758621, -0.3310344827586207, -0.2689655172413793, -0.20689655172413794, -0.14482758620689655, -0.08275862068965518, -0.02068965517241381, 0.041379310344827575, 0.10344827586206896, 0.16551724137931033, 0.22758620689655172, 0.2896551724137931, 0.35172413793103446, 0.4137931034482758, 0.47586206896551717, 0.5379310344827585, 0.6], "y": [-0.07, -0.06517241379310345, -0.0603448275862069, -0.05551724137931035, -0.0506896551724138, -0.04586206896551725, -0.04103448275862069, -0.03620689655172414, -0.03137931034482759, -0.02655172413793104, -0.02172413793103448, -0.01689655172413793, -0.01206896551724138, -0.00724137931034482, -0.00241379310344827, 0.0024137931034482793, 0.007241379310344829, 0.012068965517241378, 0.016896551724137928, 0.021724137931034477, 0.026551724137931027, 0.031379310344827576, 0.036206896551724125, 0.04103448275862067, 0.04586206896551722, 0.05068965517241377, 0.05551724137931032, 0.06034482758620687, 0.06517241379310341, 0.07]}], "layout": {"title": {"text": "Learned State-Value Function (V) for Mountain Car (Fixed Policy)"}, "xaxis": {"title": {"text": "Position"}}, "yaxis": {"title": {"text": "Velocity"}}, "width": 700, "height": 550, "margin": {"l": 60, "r": 60, "b": 60, "t": 90}}}Estimated state-value function $\hat{v}(s, \mathbf{w})$ for the Mountain Car environment under a policy that always accelerates right. Higher (less negative) values indicate states considered better by the learned approximation.DiscussionThis exercise demonstrates the application of linear function approximation for value prediction in an environment with a continuous state space. Important takeaways include:Necessity: Tabular methods are infeasible here. Function approximation allows us to handle continuous or very large state spaces by learning a parameterized function.Features Matter: The choice of features (here, tile coding) is significant. Good features help the linear approximator capture the important variations in the true value function. Tile coding provides a good balance of localization and generalization.Semi-gradient: We use semi-gradient updates because we are bootstrapping (using $\hat{v}(S', \mathbf{w})$ to update $\hat{v}(S, \mathbf{w})$) and differentiating through the target value would be complex and often harmful to stability. The gradient $\nabla \hat{v}(S, \mathbf{w}) = \mathbf{x}(S)$ makes the updates computationally efficient.Generalization: The learned weights allow us to estimate the value of any state, including those never explicitly visited during training, by combining the activations of its features. The visualization shows a smooth value map, indicating generalization.This example focused on prediction (estimating $V(s)$ for a fixed policy). The next natural step, which builds upon this foundation, is control: learning an optimal policy by approximating the action-value function $Q(s, a)$ using similar techniques (e.g., Semi-gradient SARSA or Q-learning with function approximation). You would typically define features $x(s, a)$ based on both the state and the action.