Function approximation allows for the representation of value functions, such as v^(s,w) or q^(s,a,w), using a set of parameters or weights, denoted by the vector w. For linear function approximation, this takes the form v^(s,w)=wTx(s), where x(s) is the feature vector for state s. Determining the best values for these weights w is a primary objective.
The answer lies in borrowing a fundamental technique from supervised machine learning: gradient descent. The idea is to adjust the weights iteratively to minimize some measure of error between our approximation and the true value function we're trying to learn.
Defining the Objective: Minimizing Error
Our goal is to make our approximation v^(St,w) as close as possible to the true value vπ(St) for all states St encountered under the policy π. A standard way to quantify this is using the Mean Squared Value Error (MSVE), which we aim to minimize:
J(w)=Eπ[(vπ(St)−v^(St,w))2]
This objective measures the expected squared difference between the true value and our approximation, averaged over the states visited while following policy π.
The Challenge in Reinforcement Learning: Unknown Targets
In typical supervised learning, you'd have a dataset of input-output pairs (like state features x(s) and their corresponding true values vπ(s)). You could then directly apply gradient descent to minimize the MSVE on this dataset.
However, in reinforcement learning, we almost never know the true value function vπ(s)! Instead, we have estimates or targets derived from experience. These targets are based on the rewards received and subsequent state values (either observed until the end of an episode or bootstrapped). Examples include:
- Monte Carlo (MC) Target: Ut=Gt. This is the full return observed starting from state St until the end of the episode. It's an unbiased estimate of vπ(St).
- Temporal Difference (TD) Target: Ut=Rt+1+γv^(St+1,w). This is the TD(0) target, using the immediate reward Rt+1 and the current estimate of the next state's value, v^(St+1,w). This target is biased because it relies on our current (possibly inaccurate) estimate.
Since we collect experience step-by-step (state, action, reward, next state), we can't compute the full expectation in the MSVE objective easily. Instead, we use Stochastic Gradient Descent (SGD).
Stochastic Gradient Descent (SGD) for Value Function Approximation
SGD updates the weights incrementally using one sample (or a small batch) of experience at a time. For each interaction step t yielding a state St and a target value Ut, we update the weight vector w in the direction that reduces the error for that specific sample.
The general idea of gradient descent is to update the parameters in the direction opposite to the gradient of the error function. For a single sample (St,Ut), the squared error is 21(Ut−v^(St,w))2. (We add the 21 for mathematical convenience, as it cancels out during differentiation).
The gradient of this single-sample error with respect to the weights w is:
∇w[21(Ut−v^(St,w))2]=(Ut−v^(St,w))(−∇wv^(St,w))=−(Ut−v^(St,w))∇wv^(St,w)
The SGD update rule moves the weights in the opposite direction of this gradient, scaled by a learning rate α:
wt+1←wt−α[−(Ut−v^(St,wt))∇wv^(St,wt)]
wt+1←wt+α(Ut−v^(St,wt))∇wv^(St,wt)
Here:
- wt is the weight vector before the update.
- α is the learning rate (a small positive number controlling the step size).
- (Ut−v^(St,wt)) is the prediction error for the current sample.
- ∇wv^(St,wt) is the gradient of the value function approximation with respect to the weights, evaluated at state St. This term tells us how changing each weight affects the value estimate for the current state.
Applying SGD to Linear Function Approximation
Things become quite concrete when we use linear function approximation, v^(s,w)=wTx(s)=∑i=1dwixi(s). The gradient of this approximation with respect to the weight vector w is simply the feature vector x(s) itself:
∇wv^(s,w)=x(s)
Substituting this into the general SGD update rule gives the update rule for linear VFA:
wt+1←wt+α(Ut−v^(St,wt))x(St)
This update has an intuitive interpretation:
- Calculate the error: δt=Ut−v^(St,wt). This is how far off our current prediction was from the target.
- Scale the error by the learning rate: αδt.
- Adjust the weights based on the features: Multiply the scaled error by the feature vector x(St). Features that were active (xi(St)=0) for the state St will have their corresponding weights wi adjusted. If the error δt is positive (underestimation), weights corresponding to active features are increased. If the error is negative (overestimation), they are decreased.
Flow of a single Stochastic Gradient Descent update step for linear value function approximation.
The Semi-gradient Distinction
There's a subtle but important point when using TD targets like Ut=Rt+1+γv^(St+1,wt). Notice that the target Ut itself depends on the current weights wt through the term v^(St+1,wt).
When we calculated the gradient of the single-sample squared error 21(Ut−v^(St,wt))2, we treated the target Ut as if it were a fixed, constant value, even though it depends on wt. A true gradient calculation would need to account for the derivative of the target with respect to wt, which is significantly more complex.
By ignoring the dependence of the target on the weights, we are not following the true gradient of the MSVE (using the TD target). Instead, we are following what's called a semi-gradient. Methods that use bootstrapping (like TD learning) with function approximation are therefore called semi-gradient methods.
Why do we do this?
- Simplicity: Semi-gradient updates are much simpler to compute and implement.
- Effectiveness: Empirically, semi-gradient TD methods work very well and converge reliably under standard conditions (like decaying learning rates and sufficient exploration), although they converge to a solution slightly different from the minimum MSVE solution.
In contrast, if you use the Monte Carlo target Ut=Gt, the target does not depend on the current weights wt. Therefore, SGD updates using the MC target are true gradient methods and are guaranteed to find a local minimum of the MSVE objective (assuming linear VFA and appropriate conditions).
The Learning Rate
The learning rate α is a hyperparameter that determines how large each update step is.
- If α is too large, the updates might overshoot the optimal weights, leading to oscillations or divergence.
- If α is too small, learning will be very slow.
In practice, α is often started at a moderate value and gradually decreased over time. This allows for larger adjustments early in training when the weights are likely far from optimal, and finer tuning later on.
By combining feature representations with gradient descent methods, we can now train RL agents on problems with state spaces far too large for the tabular methods we discussed previously. The next section introduces how non-linear function approximators, particularly neural networks, can be used to handle even more complex relationships between states and values.