We've seen how Temporal-Difference (TD) methods like TD(0) can estimate the state-value function Vπ for a given policy π by bootstrapping. That is, they update the value estimate for a state based on the observed reward and the estimated value of the next state. This allows learning from each step, unlike Monte Carlo methods which wait until the end of an episode.
However, estimating Vπ isn't enough if our goal is control, finding an optimal policy. For control, especially when we don't have a model of the environment, we typically need to learn action-values, Qπ(s,a). This tells us the expected return for taking action a in state s and then following policy π thereafter. Knowing the Q-values allows the agent to select the best action in each state by simply choosing the action with the highest Q-value.
SARSA is a TD method designed for control. It directly estimates the action-value function Qπ. The name SARSA highlights the sequence of events involved in its update calculation: the agent starts in a State (St), chooses an Action (At), receives a Reward (Rt+1) and transitions to the next State (St+1), and then selects the next Action (At+1) according to its current policy. This quintuple (St,At,Rt+1,St+1,At+1) forms the basis for the SARSA update.
Like TD(0), SARSA performs an update after each time step. The update rule for the action-value function Q(St,At) is:
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)−Q(St,At)]Let's break this down:
The update essentially nudges the current estimate Q(St,At) towards the TD target. Notice the significant element here: the target includes Q(St+1,At+1), which uses the action At+1 actually chosen by the agent's current policy in state St+1.
SARSA uses this update rule within a general control loop that aims to improve the policy over time. Since we need to learn action-values for all actions to find the best one, the policy followed must explore. A common approach is to use an ϵ-greedy policy based on the current Q-value estimates. This means the agent usually chooses the action with the highest Q-value (exploits) but occasionally chooses a random action (explores) with probability ϵ.
Here's the typical structure of the SARSA algorithm:
Initialization:
Episode Loop: Repeat for each episode:
SARSA is classified as an on-policy algorithm. This means it learns the value function Qπ for the policy π that the agent is currently following. The policy being followed includes the exploration steps (e.g., the random actions taken by ϵ-greedy).
The on-policy nature comes directly from the update rule: Q(St+1,At+1) uses the action At+1 that was actually selected by the current policy in state St+1. The algorithm estimates the value of taking actions based on the agent's actual behavior, including its exploration strategy.
This has practical implications. If the agent explores using ϵ-greedy, SARSA learns the values associated with this ϵ-greedy policy. The resulting Q-values reflect the expected return considering that the agent might take a random, suboptimal action with probability ϵ. This often leads to finding "safer" paths in environments where exploration mistakes can be costly (e.g., falling off a cliff). The agent learns to account for its own potential missteps during exploration.
Contrast this with Monte Carlo control methods, which also learn based on the policy being followed, but only update after the episode ends. SARSA provides the benefit of TD learning (updating at each step) while still learning about the policy in use.
In the next section, we will explore Q-learning, a closely related TD control algorithm that takes a different approach: it learns the optimal policy directly, regardless of the exploration policy being followed. This makes Q-learning an off-policy method.
© 2025 ApX Machine Learning