While SARSA learns the value of the policy the agent is currently following (including its exploration steps), Q-Learning takes a different, often more direct, route. Developed by Chris Watkins in 1989, Q-Learning is a foundational off-policy Temporal Difference (TD) control algorithm.
What does "off-policy" mean in this context? It means the policy being learned about (the target policy) is different from the policy used to generate the behavior (the behavior policy). Q-Learning aims directly for the optimal action-value function, Q∗(S,A), regardless of the policy the agent is actually using to explore the environment. The target policy in Q-Learning is the greedy policy with respect to the current action-value estimates, while the behavior policy can be something more exploratory, like epsilon-greedy.
The core of Q-Learning lies in its update rule. After taking action At in state St and observing the reward Rt+1 and the next state St+1, Q-Learning updates its estimate Q(St,At) using the following rule:
Q(St,At)←Q(St,At)+α[Rt+1+γamaxQ(St+1,a)−Q(St,At)]Let's break this down:
The use of the maximum over next actions allows Q-Learning to learn about the greedy policy (which would always choose the action a
maximizing Q(St+1,a)) even if the agent is behaving differently (e.g., occasionally taking random actions for exploration via an epsilon-greedy policy).
Comparison of information used in the TD target calculation for SARSA and Q-Learning. SARSA uses the Q-value of the actual next state-action pair (St+1,At+1), while Q-Learning uses the maximum Q-value over all possible actions a in the next state St+1.
Here's the pseudocode for the Q-Learning algorithm:
Initialize Q(s, a) arbitrarily for all s in S, a in A(s)
(e.g., Q(s, a) = 0)
Initialize Q(terminal_state, .) = 0
Loop for each episode:
Initialize S (starting state for the episode)
Loop for each step of the episode:
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Take action A, observe R, S'
# Core Q-Learning Update
Q(S, A) <- Q(S, A) + alpha * [R + gamma * max_a Q(S', a) - Q(S, A)]
S <- S' # Move to the next state
If S is terminal, break inner loop (end episode)
End Loop (episodes)
Algorithm Steps:
Q-Learning is proven to converge to the optimal action-value function Q∗ under standard stochastic approximation conditions, provided that all state-action pairs continue to be visited. Using an epsilon-greedy behavior policy where ϵ decays over time ensures sufficient exploration initially and eventual convergence towards exploiting the learned optimal policy.
Because Q-Learning directly learns the optimal Q-function, it's often considered more sample-efficient for finding the optimal policy compared to on-policy methods like SARSA, especially when exploration is extensive. However, this direct pursuit of optimality means it doesn't account for the costs or risks associated with the exploration policy itself, which can be a consideration in safety-critical applications where SARSA might be preferred during the learning phase.
© 2025 ApX Machine Learning