Storing values for every state or state-action pair in a table becomes infeasible when dealing with large or continuous state spaces. Imagine trying to create a Q-table for a self-driving car where the state includes sensor readings like camera images and Lidar data; the state space is practically infinite! Tabular methods simply don't scale.
The solution is to move from explicit storage to estimation. Instead of learning the exact value or for every state or state-action pair, we learn a parameterized function that approximates these values. This technique is known as Value Function Approximation (VFA).
We introduce a function, let's call it , which takes a state and a parameter vector as input, and outputs an estimated state value:
Similarly, we can approximate the action-value function with a function that takes the state , action , and parameters :
Here, is a vector of weights or parameters (e.g., coefficients in a linear model or weights in a neural network). The main idea is that the number of parameters in is significantly smaller than the total number of states or state-action pairs . For instance, we might have millions of states but only need a few hundred or thousand parameters in .
The primary advantage of using function approximation is generalization. Because the function approximator learns a relationship based on the parameters , it can estimate values even for states it hasn't encountered before, or hasn't encountered very often. If two states and are considered "similar" (often determined by how we represent them, which we'll discuss soon), the function approximator will likely produce similar value estimates and . This allows the agent to leverage experience gained in one part of the state space to make better decisions in other, similar parts. Tabular methods, in contrast, treat each state independently; learning about tells you nothing about .
We can use various types of functions for and . Common choices include:
In this course, we will primarily focus on linear methods and introduce the concepts behind using neural networks.
Our objective when using VFA is to find the parameter vector that makes our approximation or as close as possible to the true value function or (or the optimal or ). This is typically framed as minimizing an error objective, such as the Mean Squared Value Error (MSVE) over the distribution of states encountered:
where is some weighting indicating how much we care about the error in state .
While minimizing the MSVE directly is often the goal, the algorithms we'll use (like adaptations of TD learning) actually optimize slightly different objectives due to the nature of RL updates.
You might notice that finding the parameters sounds similar to supervised learning. We have inputs (states , or state-action pairs ) and we want to predict target outputs (the true values or ). Indeed, we will use techniques like gradient descent, familiar from supervised learning, to update .
However, there's a significant difference: in RL, we usually don't know the true target values or . Instead, we use estimates of these values derived from interaction with the environment (e.g., observed rewards and subsequent estimated values, as in TD learning). This means our target values are often noisy, biased, and non-stationary (they change as the policy and value estimates improve), which presents unique challenges compared to standard supervised learning.
In the following sections, we'll look at how to represent states using features and how to apply gradient-based methods to learn the parameters for both linear and non-linear function approximators within the RL framework.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with