The methods reviewed so far, such as Q-Learning, SARSA, and basic Policy Gradients, often rely on a tabular representation of value functions (like V(s) or Q(s,a)) or policies (π(a∣s)). This means we maintain an explicit value or probability for every single state, or state-action pair. While conceptually simple and theoretically well-understood in small, discrete environments (like tic-tac-toe or small grid worlds), this approach quickly becomes impractical as the complexity of the problem grows.
Consider environments with very large or continuous state spaces.
This explosion in the number of states or state-action pairs is often called the curse of dimensionality. Tabular methods suffer from it in two main ways:
Furthermore, tabular methods cannot generalize. Learning about one state provides no information about similar, unvisited states. If you learn that a particular chess position is bad, a tabular method has no way to infer that a very slightly different, but strategically equivalent, position is also bad without visiting it explicitly.
To overcome these limitations, we introduce function approximation. Instead of storing explicit values for each state or state-action pair in a table, we use a parameterized function, often denoted with parameters θ, to approximate the true value function or policy.
Here, V^, Q^, πθ, and μθ represent functions (like linear functions, neural networks, decision trees, etc.) whose behavior is determined by the parameter vector θ. The goal of the learning algorithm becomes finding the parameters θ that make the approximation as accurate as possible.
Contrasting tabular methods with function approximation for representing value functions or policies.
Various function approximators can be used in RL:
Linear Function Approximation: This is conceptually the simplest form. We first define a feature vector ϕ(s) that represents the state s (or ϕ(s,a) for state-action pairs). The approximation is then a linear combination of these features:
V^(s;θ)=θTϕ(s)=i=1∑dθiϕi(s) Q^(s,a;θ)=θTϕ(s,a)=i=1∑dθiϕi(s,a)Here, d is the number of features. The quality of approximation heavily depends on the quality of the hand-crafted features ϕ. While simple and computationally efficient, linear methods may lack the capacity to represent complex, non-linear value functions or policies.
Non-linear Function Approximation (Neural Networks): Deep Neural Networks (DNNs) have become the dominant choice for function approximation in modern RL, leading to the field of Deep Reinforcement Learning (DRL). A neural network can learn complex, non-linear relationships directly from raw inputs (like pixels from a screen or joint angles) without requiring manual feature engineering.
Common architectures include Multi-Layer Perceptrons (MLPs) for vector inputs, Convolutional Neural Networks (CNNs) for image inputs, and Recurrent Neural Networks (RNNs, like LSTMs or GRUs) for sequential inputs or partially observable environments. The power of DNNs lies in their ability to learn hierarchical features automatically and approximate highly complex functions.
RL algorithms are adapted to work with function approximators, typically by using gradient-based optimization methods. Instead of updating table entries, we update the parameters θ.
Value-Based Methods (e.g., Q-Learning): In tabular Q-learning, the update rule adjusts Q(st,at) towards a target value yt=rt+1+γmaxa′Q(st+1,a′). With function approximation Q^(s,a;θ), we treat this as a supervised learning problem. The target yt (often computed using a separate, slowly updated target network θ− for stability, as we'll see in Chapter 2) becomes the "label" for the input (st,at). We aim to minimize the error between the predicted Q-value Q^(st,at;θ) and the target yt. A common loss function is the Mean Squared Bellman Error (MSBE):
L(θ)=E[(yt−Q^(st,at;θ))2]where yt=rt+1+γmaxa′Q^(st+1,a′;θ−). The parameters θ are then updated using gradient descent on this loss: θ←θ−α∇θL(θ).
Policy-Based Methods (e.g., REINFORCE): As reviewed earlier, the policy gradient theorem provides an expression for the gradient of the expected return J(θ):
∇θJ(θ)=Eτ∼πθ[t=0∑T−1Gt∇θlogπθ(at∣st)]Here, πθ(a∣s) is our parameterized policy (e.g., a neural network). We estimate this gradient using sampled trajectories (τ) generated by running the current policy πθ in the environment. The parameters are then updated using gradient ascent: θ←θ+α∇θJ(θ).
Actor-Critic Methods: These methods, which we will explore in detail in Chapter 3, use function approximation for both the policy (actor) and the value function (critic). The critic helps provide lower-variance estimates for the policy update, combining ideas from both value-based and policy-based approaches.
Using function approximation offers significant advantages:
However, it also introduces new challenges:
Function approximation is indispensable for applying RL to complex, real-world problems. Understanding how to integrate it effectively with core RL algorithms, and being aware of the associated challenges, is fundamental to mastering advanced reinforcement learning techniques. The subsequent chapters will build extensively on these ideas, particularly using deep neural networks as the primary function approximators.
© 2025 ApX Machine Learning