As we discussed in the chapter introduction, representing value functions using tables becomes infeasible when dealing with problems that have a very large number of states or continuous state spaces. Imagine trying to create a table entry for every possible pixel configuration on a screen or every possible joint angle combination for a complex robot! Function approximation provides a way forward by estimating the value function using a more compact representation. A fundamental step in this process is defining how we represent the state s itself as input to our approximator. Instead of using the raw state description, we often work with a feature vector, denoted as x(s).
A feature vector x(s) is typically a column vector of real-valued numbers that captures important properties of the state s:
x(s)=x1(s)x2(s)⋮xd(s)Here, d is the number of features, and each xi(s) is the value of the i-th feature for state s. The crucial idea is that the number of features d is usually much, much smaller than the total number of states ∣S∣. This compressed representation allows our learning algorithm to generalize. If two states s and s′ have similar feature vectors, x(s)≈x(s′), the function approximator will likely produce similar value estimates for them, even if one of the states has never been visited.
The goal is to select features that are informative about the state's value. Good features should:
The process of designing these features is often called feature engineering. It can range from simple techniques to quite sophisticated ones, sometimes requiring significant domain expertise.
Let's look at some ways to construct feature vectors:
For some problems, the state representation might already be a relatively low-dimensional vector of numerical values. In the classic CartPole environment, for instance, the state is often represented by four numbers: cart position, cart velocity, pole angle, and pole angular velocity.
s=(position,velocity,angle,angular velocity)In such cases, we might simply use the state representation directly as the feature vector:
x(s)=positionvelocityangleangular velocityEven here, normalization or scaling of these raw values is often beneficial.
When the raw state is complex (like an image or a configuration in a board game), we can use our understanding of the problem domain to manually define relevant features.
Creating good hand-crafted features often requires iteration and experimentation.
If the underlying value function is expected to be a non-linear function of some basic state variables, we can create polynomial features. For a state s with basic features f1(s) and f2(s), we could create a feature vector including polynomial terms:
x(s)=(1,f1(s),f2(s),f1(s)2,f2(s)2,f1(s)f2(s),…)TThe constant feature '1' allows the approximator to learn a bias term. This approach can capture some non-linearities but can lead to a high number of features quickly.
Tile coding is a popular technique, especially for continuous state spaces. It discretizes the space by overlaying multiple grids (tilings), each offset from the others. A state activates one tile in each tiling. The feature vector x(s) is then typically a large binary vector where each component corresponds to one tile across all tilings. A component is 1 if the corresponding tile is active for state s, and 0 otherwise.
A conceptual illustration of tile coding in 2D. A state (red dot) falls into one tile in each offset grid (Tiling 1 and Tiling 2). The active tiles (e.g., b2 and y3) correspond to '1' entries in the binary feature vector.
The key benefit of tile coding is its ability to provide good generalization. States close to each other will share some active tiles, leading to similar feature vectors and thus similar value estimates. The degree of generalization is controlled by the width of the tiles and the number/offset of the tilings.
Another approach for continuous spaces involves using Radial Basis Functions. You define a set of prototype points (centers) ci in the state space. The features are then calculated based on the distance between the current state s and these centers, typically using a function like a Gaussian kernel:
xi(s)=exp(−2σi2∥s−ci∥2)Each feature xi(s) measures the similarity of state s to the center ci. States near a particular center will have a high value for the corresponding feature. Like tile coding, RBFs provide smooth generalization.
Instead of manually designing features, we can use powerful function approximators like deep neural networks to learn the relevant features directly from raw, high-dimensional input (like pixels from a screen or sensor readings). This is the core idea behind Deep Reinforcement Learning, which we will introduce in Chapter 7 (Deep Q-Networks). The network's hidden layers transform the raw input into progressively more abstract and useful representations (features) that are then used by the output layer to estimate the value function.
So far, we've focused on features x(s) for state-value functions V(s). When approximating action-value functions Q(s,a), the features usually need to depend on both the state and the action. We denote this feature vector as x(s,a).
A common approach is to first compute state features x(s) and then combine them based on the action a. For example, if there are ∣A∣ discrete actions, one might construct x(s,a) by taking the state feature vector x(s) and placing it into a specific block of a larger vector, corresponding to action a, with zeros elsewhere.
Once we have a feature vector x(s) (or x(s,a)), it serves as the input to our chosen function approximator. For example, in linear function approximation (which we'll cover next), the value function is estimated as a weighted sum of the features:
v^(s,w)≈wTx(s)=i=1∑dwixi(s)or for action-values:
q^(s,a,w)≈wTx(s,a)=i=1∑dwixi(s,a)Here, w=(w1,w2,…,wd)T is the vector of weights (parameters) that the RL algorithm learns. The goal of the learning process is to find the weights w such that the approximation v^ or q^ is close to the true value function vπ or qπ.
Choosing or designing appropriate features is often critical for the success of function approximation methods in reinforcement learning. Good features make the learning task easier and enable effective generalization across the state space. In the following sections, we will explore how to learn the weights w using these feature representations.
© 2025 ApX Machine Learning