Okay, we've established that function approximation lets us represent value functions like v^(s,w) or q^(s,a,w) using a set of parameters or weights, denoted by the vector w. For linear function approximation, this looks like v^(s,w)=wTx(s), where x(s) is the feature vector for state s. Now, the central question is: how do we find the best values for these weights w?
The answer lies in borrowing a fundamental technique from supervised machine learning: gradient descent. The idea is to adjust the weights iteratively to minimize some measure of error between our approximation and the true value function we're trying to learn.
Our goal is to make our approximation v^(St,w) as close as possible to the true value vπ(St) for all states St encountered under the policy π. A standard way to quantify this is using the Mean Squared Value Error (MSVE), which we aim to minimize:
J(w)=Eπ[(vπ(St)−v^(St,w))2]This objective measures the expected squared difference between the true value and our approximation, averaged over the states visited while following policy π.
In typical supervised learning, you'd have a dataset of input-output pairs (like state features x(s) and their corresponding true values vπ(s)). You could then directly apply gradient descent to minimize the MSVE on this dataset.
However, in reinforcement learning, we almost never know the true value function vπ(s)! Instead, we have estimates or targets derived from experience. These targets are based on the rewards received and subsequent state values (either observed until the end of an episode or bootstrapped). Examples include:
Since we collect experience step-by-step (state, action, reward, next state), we can't compute the full expectation in the MSVE objective easily. Instead, we use Stochastic Gradient Descent (SGD).
SGD updates the weights incrementally using one sample (or a small batch) of experience at a time. For each interaction step t yielding a state St and a target value Ut, we update the weight vector w in the direction that reduces the error for that specific sample.
The general idea of gradient descent is to update the parameters in the direction opposite to the gradient of the error function. For a single sample (St,Ut), the squared error is 21(Ut−v^(St,w))2. (We add the 21 for mathematical convenience, as it cancels out during differentiation).
The gradient of this single-sample error with respect to the weights w is:
∇w[21(Ut−v^(St,w))2]=(Ut−v^(St,w))(−∇wv^(St,w))=−(Ut−v^(St,w))∇wv^(St,w)The SGD update rule moves the weights in the opposite direction of this gradient, scaled by a learning rate α:
wt+1←wt−α[−(Ut−v^(St,wt))∇wv^(St,wt)] wt+1←wt+α(Ut−v^(St,wt))∇wv^(St,wt)Here:
Things become quite concrete when we use linear function approximation, v^(s,w)=wTx(s)=∑i=1dwixi(s). The gradient of this approximation with respect to the weight vector w is simply the feature vector x(s) itself:
∇wv^(s,w)=x(s)Substituting this into the general SGD update rule gives the update rule for linear VFA:
wt+1←wt+α(Ut−v^(St,wt))x(St)This update has an intuitive interpretation:
Flow of a single Stochastic Gradient Descent update step for linear value function approximation.
There's a subtle but important point when using TD targets like Ut=Rt+1+γv^(St+1,wt). Notice that the target Ut itself depends on the current weights wt through the term v^(St+1,wt).
When we calculated the gradient of the single-sample squared error 21(Ut−v^(St,wt))2, we treated the target Ut as if it were a fixed, constant value, even though it depends on wt. A true gradient calculation would need to account for the derivative of the target with respect to wt, which is significantly more complex.
By ignoring the dependence of the target on the weights, we are not following the true gradient of the MSVE (using the TD target). Instead, we are following what's called a semi-gradient. Methods that use bootstrapping (like TD learning) with function approximation are therefore called semi-gradient methods.
Why do we do this?
In contrast, if you use the Monte Carlo target Ut=Gt, the target does not depend on the current weights wt. Therefore, SGD updates using the MC target are true gradient methods and are guaranteed to find a local minimum of the MSVE objective (assuming linear VFA and appropriate conditions).
The learning rate α is a hyperparameter that determines how large each update step is.
In practice, α is often started at a moderate value and gradually decreased over time. This allows for larger adjustments early in training when the weights are likely far from optimal, and finer tuning later on.
By combining feature representations with gradient descent methods, we can now train RL agents on problems with state spaces far too large for the tabular methods we discussed previously. The next section introduces how non-linear function approximators, particularly neural networks, can be used to handle even more complex relationships between states and values.
© 2025 ApX Machine Learning