While policy constraint methods like BCQ directly restrict the learned policy to actions resembling those in the offline dataset, Conservative Q-Learning (CQL) takes a different approach. Instead of modifying the policy search space, CQL modifies the Q-function learning objective itself to combat the overestimation of values for out-of-distribution (OOD) actions, a primary consequence of distributional shift.
The fundamental problem arises when standard Q-learning updates, like those in DQN or SAC, use the maximum Q-value over actions in the Bellman target: y=r+γmaxa′Qθˉ(s′,a′). If the Q-function Qθˉ incorrectly assigns high values to OOD actions a′ (actions unlikely or unseen under the behavior policy πb that generated the dataset D), these errors propagate through the learning process, leading to a suboptimal final policy.
CQL introduces a regularization term into the standard Q-learning loss function. This regularizer encourages the Q-function Qθ(s,a) to assign low values to OOD actions while ensuring that the Q-values for actions actually present in the dataset D remain accurate or are pushed higher by the Bellman updates. This results in a "conservative" estimate of the Q-values, preventing the agent from exploiting potentially overestimated values of unseen actions.
The CQL objective typically augments a standard TD-based loss, LTD(θ), with a state-action value regularizer, RCQL(θ). The combined objective is:
LCQL(θ)=αRCQL(θ)+LTD(θ)
Here, α≥0 is a hyperparameter that weights the conservatism penalty. LTD(θ) could be the mean squared Bellman error from DQN or the soft Bellman residual from SAC.
The core component is the regularizer RCQL(θ). A common form aims to minimize the Q-values under some proposal distribution μ(a∣s) while maximizing the Q-values for actions sampled from the dataset D:
RCQL(θ)=Es∼D[Ea∼μ(a∣s)[Qθ(s,a)]−Ea∼πb(a∣s)[Qθ(s,a)]]
Alternatively, using a log-sum-exp formulation (which implicitly covers all actions), the regularizer can be written as:
RCQL(θ)=Es∼D[log∑aexp(Qθ(s,a))−Ea∼πb(a∣s)[Qθ(s,a)]]
In practice, Ea∼πb(a∣s)[Qθ(s,a)] is approximated using the specific action a from the transition (s,a,r,s′) sampled from the dataset D. The expectation Ea∼μ(a∣s) or the log-sum-exp term requires sampling or evaluating Q-values for multiple actions at state s. These actions might be sampled uniformly or from the agent's current policy.
Let's break down the regularizer's effect:
Minimizing the overall RCQL(θ) term forces the learned Q-function Qθ(s,a) to have values for actions seen in the data that are higher than (or bounded below by) the values of actions not seen in the data. The TD loss LTD(θ) simultaneously ensures these in-distribution Q-values are consistent with the Bellman equation.
Standard Q-learning might overestimate values for out-of-distribution (OOD) actions. CQL adds a penalty that pushes down the Q-values of OOD actions while ensuring the values for actions seen in the dataset remain grounded by the TD updates.
Compared to policy constraint methods, CQL offers a different philosophy. It allows the policy to potentially represent any action but relies on the learned conservative Q-values to guide the policy towards actions supported by the data. This value-based regularization provides a powerful alternative for tackling the challenges inherent in learning from fixed datasets.
© 2025 ApX Machine Learning