While SARSA learns based on the actual next action taken (A′) according to the current policy π, and Q-learning learns based on the greedy next action (maxa′Q(S′,a′)), there's another related on-policy algorithm called Expected SARSA that offers a slight variation with potentially beneficial properties.
Recall the SARSA update rule:
Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]The term Q(S′,A′) depends on the specific action A′ that was sampled from the policy π in state S′. This sampling introduces randomness (variance) into the update target. If the policy is highly stochastic, or if certain actions have significantly different Q-values, this variance can slow down or destabilize learning.
Expected SARSA aims to reduce this variance by replacing the single sampled next action-value Q(S′,A′) with the expected value of the next action-value, averaged over all possible next actions a′ according to the current policy π(a′∣S′).
The update rule for Expected SARSA is:
Q(S,A)←Q(S,A)+α[R+γa′∑π(a′∣S′)Q(S′,a′)−Q(S,A)]Let's break down the target value R+γ∑a′π(a′∣S′)Q(S′,a′):
Essentially, instead of waiting to see which specific action A′ the policy chooses next and using its Q-value, Expected SARSA considers all possible next actions and their probabilities according to the current policy to compute a smoother, averaged target.
Expected SARSA shares similarities with both SARSA and Q-learning but has distinct characteristics:
Here's a quick comparison:
Feature | SARSA | Q-Learning | Expected SARSA |
---|---|---|---|
Update Target | R+γQ(S′,A′) | R+γmaxa′Q(S′,a′) | $R + \gamma \sum_{a'} \pi(a' |
Policy Type | On-policy | Off-policy | On-policy |
Basis for A′ | Action actually taken based on π | Action yielding maximum Q-value in S′ | Expected value over all actions a′ based on π in S′ |
Variance | Higher (uses sampled A′) | Lower (uses max) | Generally Lower than SARSA (uses expectation) |
Expected SARSA often provides a good balance, retaining the on-policy nature of SARSA while achieving variance reduction similar to Q-learning, sometimes leading to improved performance stability. It's a valuable alternative to consider within the family of TD control algorithms.
© 2025 ApX Machine Learning