Having revisited the core components of MDPs and the significance of value functions, we now turn to algorithms that learn these value functions directly from experience, without needing a predefined model of the environment's dynamics (transitions and rewards). These are known as model-free methods. Among the most foundational are Q-Learning and SARSA, both belonging to the family of Temporal Difference (TD) learning algorithms. They update value estimates based on other learned estimates, bootstrapping from the current value function. These methods operate effectively when the state and action spaces are small enough to be represented in a table, hence the term "tabular methods".
Q-Learning is a popular off-policy TD control algorithm. Its goal is to directly learn the optimal action-value function, Q∗(s,a), which represents the expected return starting from state s, taking action a, and thereafter following the optimal policy π∗. Being "off-policy" means that Q-learning learns the optimal value function even if the actions taken during learning are determined by a different, potentially exploratory policy (like epsilon-greedy).
The agent interacts with the environment, observing transitions (St,At,Rt+1,St+1). After each transition, it updates its estimate of Q(St,At) using the following rule:
Q(St,At)←Q(St,At)+α[Rt+1+γa′maxQ(St+1,a′)−Q(St,At)]Let's break down this update:
The term Rt+1+γmaxa′Q(St+1,a′) is called the TD target. The difference between the TD target and the current estimate Q(St,At) is the TD error, which drives the learning update.
To ensure sufficient exploration, Q-learning is often paired with a policy like epsilon-greedy, where the agent chooses a random action with probability ϵ and the action with the highest current Q-value (argmaxaQ(St,a)) with probability 1−ϵ. Importantly, the update rule uses the max
operator, learning about the greedy (optimal) action, even if an exploratory action was actually taken.
Diagram illustrating the flow of information for a single Q-learning update step.
SARSA (State-Action-Reward-State-Action) is an alternative TD control algorithm that is on-policy. Unlike Q-learning, SARSA learns the action-value function Q(s,a) corresponding to the current policy being followed, including its exploration behavior.
The update process requires knowing the next action At+1 that the current policy chooses in the next state St+1. The quintuple (St,At,Rt+1,St+1,At+1) gives the algorithm its name. The update rule is:
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)−Q(St,At)]Notice the key difference from Q-learning: instead of taking the maximum Q-value over all possible next actions (maxa′Q(St+1,a′)), SARSA uses the Q-value of the specific next action At+1 that the agent actually will take according to its current policy (e.g., the action chosen by the epsilon-greedy strategy for state St+1).
This makes SARSA sensitive to the policy being executed. If the policy is exploratory (high ϵ), SARSA will learn Q-values that account for the possibility of taking suboptimal exploratory actions. This can lead to a more "conservative" agent during learning compared to Q-learning, which always assumes the optimal action will be taken in the next step for its update target.
Diagram illustrating the flow of information for a single SARSA update step, requiring the next action At+1.
Both Q-learning and SARSA form the bedrock of many advanced RL techniques. They work by iteratively updating value estimates stored in a table, indexed by state-action pairs. However, their reliance on this explicit table becomes a major bottleneck when dealing with problems where the number of states or actions is very large or infinite (e.g., continuous spaces), which is the topic of the next section.
© 2025 ApX Machine Learning