Different reinforcement learning algorithms approach policy updates in various ways. SARSA, for instance, is an on-policy method that learns based on the actual next action taken (A′) according to the current policy π. Q-learning, by contrast, is an off-policy method that learns based on the greedy next action (maxa′Q(S′,a′)). A related on-policy algorithm, Expected SARSA, offers a slight variation from these methods with potentially beneficial properties.
Recall the SARSA update rule:
Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]The term Q(S′,A′) depends on the specific action A′ that was sampled from the policy π in state S′. This sampling introduces randomness (variance) into the update target. If the policy is highly stochastic, or if certain actions have significantly different Q-values, this variance can slow down or destabilize learning.
Expected SARSA aims to reduce this variance by replacing the single sampled next action-value Q(S′,A′) with the expected value of the next action-value, averaged over all possible next actions a′ according to the current policy π(a′∣S′).
The update rule for Expected SARSA is:
Q(S,A)←Q(S,A)+α[R+γa′∑π(a′∣S′)Q(S′,a′)−Q(S,A)]Let's break down the target value R+γ∑a′π(a′∣S′)Q(S′,a′):
Essentially, instead of waiting to see which specific action A′ the policy chooses next and using its Q-value, Expected SARSA considers all possible next actions and their probabilities according to the current policy to compute a smoother, averaged target.
Expected SARSA shares similarities with both SARSA and Q-learning but has distinct characteristics:
Here's a quick comparison:
| Feature | SARSA | Q-Learning | Expected SARSA |
|---|---|---|---|
| Update Target | R+γQ(S′,A′) | R+γmaxa′Q(S′,a′) | $R + \gamma \sum_{a'} \pi(a' |
| Policy Type | On-policy | Off-policy | On-policy |
| Basis for A′ | Action actually taken based on π | Action yielding maximum Q-value in S′ | Expected value over all actions a′ based on π in S′ |
| Variance | Higher (uses sampled A′) | Lower (uses max) | Generally Lower than SARSA (uses expectation) |
Expected SARSA often provides a good balance, retaining the on-policy nature of SARSA while achieving variance reduction similar to Q-learning, sometimes leading to improved performance stability. It's a valuable alternative to consider within the family of TD control algorithms.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with