Different reinforcement learning algorithms approach policy updates in various ways. SARSA, for instance, is an on-policy method that learns based on the actual next action taken (A′) according to the current policy π. Q-learning, by contrast, is an off-policy method that learns based on the greedy next action (maxa′Q(S′,a′)). A related on-policy algorithm, Expected SARSA, offers a slight variation from these methods with potentially beneficial properties.
Recall the SARSA update rule:
Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]
The term Q(S′,A′) depends on the specific action A′ that was sampled from the policy π in state S′. This sampling introduces randomness (variance) into the update target. If the policy is highly stochastic, or if certain actions have significantly different Q-values, this variance can slow down or destabilize learning.
Expected SARSA aims to reduce this variance by replacing the single sampled next action-value Q(S′,A′) with the expected value of the next action-value, averaged over all possible next actions a′ according to the current policy π(a′∣S′).
The update rule for Expected SARSA is:
Q(S,A)←Q(S,A)+α[R+γa′∑π(a′∣S′)Q(S′,a′)−Q(S,A)]
Let's break down the target value R+γ∑a′π(a′∣S′)Q(S′,a′):
- R: The immediate reward received after taking action A in state S.
- γ: The discount factor.
- ∑a′π(a′∣S′)Q(S′,a′): This is the expected Q-value for the next state S′, calculated by summing the Q-value of each possible next action a′ weighted by the probability of taking that action under the current policy π.
Essentially, instead of waiting to see which specific action A′ the policy chooses next and using its Q-value, Expected SARSA considers all possible next actions and their probabilities according to the current policy to compute a smoother, averaged target.
Comparison with SARSA and Q-Learning
Expected SARSA shares similarities with both SARSA and Q-learning but has distinct characteristics:
- vs. SARSA: Both are on-policy methods (the update uses values derived from the policy π being followed). However, Expected SARSA computes the expectation over next actions, while SARSA uses a single sample A′. This generally gives Expected SARSA lower variance, which can lead to more stable performance, especially in environments with high stochasticity or near "cliffs" where a single wrong sample can be misleading. The computational cost per update is slightly higher for Expected SARSA if the action space is large, as it requires iterating through all possible next actions a′.
- vs. Q-Learning: The update rules look similar, but Q-learning uses a max operator in its target (R+γmaxa′Q(S′,a′)), making it off-policy as it updates towards the greedy action regardless of the policy being followed. Expected SARSA uses an expectation based on the current policy π, keeping it on-policy. Interestingly, if the policy π becomes greedy with respect to the current Q-values (i.e., π(a′∣S′)=1 for a′=argmaxa′′Q(S′,a′′) and 0 otherwise), the Expected SARSA update becomes identical to the Q-learning update.
Here's a quick comparison:
| Feature |
SARSA |
Q-Learning |
Expected SARSA |
| Update Target |
R+γQ(S′,A′) |
R+γmaxa′Q(S′,a′) |
$R + \gamma \sum_{a'} \pi(a' |
| Policy Type |
On-policy |
Off-policy |
On-policy |
| Basis for A′ |
Action actually taken based on π |
Action yielding maximum Q-value in S′ |
Expected value over all actions a′ based on π in S′ |
| Variance |
Higher (uses sampled A′) |
Lower (uses max) |
Generally Lower than SARSA (uses expectation) |
Expected SARSA often provides a good balance, retaining the on-policy nature of SARSA while achieving variance reduction similar to Q-learning, sometimes leading to improved performance stability. It's a valuable alternative to consider within the family of TD control algorithms.