Distributional Reinforcement Learning Concepts

Standard Deep Q-Networks (DQN) and its variants focus on estimating the expected future discounted return, the Q-value $Q(s, a) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0=s, A_0=a]$ . This expectation summarizes the potential outcomes of taking action $a$ in state $s$ into a single scalar value. However, this compression loses potentially valuable information about the variability and shape of the return distribution.

For example, an agent chooses between two paths. Path A reliably yields a moderate reward. Path B offers a chance at a very high reward but also carries a significant risk of a large penalty. Both paths might have the same expected return, making standard DQN indifferent between them. Yet, the underlying risk profiles are drastically different. Distributional Reinforcement Learning addresses this by directly modeling the probability distribution of the random return $Z(s, a)$ , rather than just its expectation $\mathbb{E}[Z(s, a)]$ .

The Distributional Bellman Equation

The core idea extends the Bellman equation to distributions. Let $Z(s, a)$ be the random variable representing the return obtained by starting in state $s$ , taking action $a$ , and following the current policy thereafter. The standard Bellman optimality equation relates the expected values:

Q^*(s, a) = \mathbb{E}[R(s, a) + \gamma \max_{a'} Q^*(s', a')]

The distributional version relates the distributions themselves:

Z(s, a) \stackrel{D}{=} R(s, a) + \gamma Z(s', a'^*)

Here, $\stackrel{D}{=}$ signifies equality in distribution. The random return $Z(s, a)$ has the same distribution as the sum of the immediate (potentially stochastic) reward $R(s, a)$ and the discounted random return $Z(s', a'^*)$ associated with taking the optimal action $a'^*$ in the next state $s'$ . The optimal next action $a'^*$ is typically chosen by maximizing the expected value of the next state's return distribution: $a'^* = \arg\max_{a'} \mathbb{E}[Z(s', a')]$ . This equation provides a recursive definition for the return distribution, forming the foundation for learning algorithms.

Representing Return Distributions

Representing and learning a potentially continuous probability distribution is challenging. Practical algorithms use approximations:

Categorical DQN (C51)

Proposed by Bellemare et al. (2017), the C51 algorithm approximates the return distribution $Z(s, a)$ using a discrete distribution supported on a fixed set of $N$ "atoms". These atoms $z_1, z_2, \dots, z_N$ are typically chosen to be equally spaced points within a plausible range of returns $[V_{MIN}, V_{MAX}]$ .

The deep neural network, instead of outputting a single Q-value per action, outputs a probability distribution over these $N$ atoms for each action. For a given state $s$ , the network outputs $N \times |\mathcal{A}|$ values, usually passed through a softmax function for each action $a$ to produce probabilities $p_i(s, a)$ :

p_i(s, a) \approx \mathbb{P}(Z(s, a) = z_i) \quad \text{such that} \quad \sum_{i=1}^N p_i(s, a) = 1

The expected Q-value can be easily recovered if needed: $Q(s, a) = \sum_{i=1}^N z_i p_i(s, a)$ .

Learning Update: The learning process involves applying a distributional Bellman update. For a transition $(s, a, r, s')$ , the target distribution is constructed as follows:

Compute the optimal next action $a'^* = \arg\max_{a'} \sum_{j=1}^N z_j p_j(s', a')$ using the target network's output for state $s'$ .
For each atom $z_j$ in the target network's distribution for $(s', a'^*)$ , compute the Bellman target: $\hat{\mathcal{T}} z_j = r + \gamma z_j$ . This represents a possible discounted future return shifted by the immediate reward $r$ .
Since $\hat{\mathcal{T}} z_j$ may not align perfectly with the fixed atom locations $\{z_i\}$ , its probability mass $p_j(s', a'^*)$ is projected onto the neighboring support atoms. Specifically, if $\hat{\mathcal{T}} z_j$ falls between $z_k$ and $z_{k+1}$ , the mass $p_j(s', a'^*)$ is distributed linearly between $z_k$ and $z_{k+1}$ based on proximity.
The final target distribution $d'_{target}$ is the sum of these projected probabilities over all atoms $j=1, \dots, N$ .
The network is trained by minimizing the Kullback-Leibler (KL) divergence between the predicted distribution $d(s, a) = \{p_i(s, a)\}_{i=1}^N$ and the computed target distribution $d'_{target}$ . This acts as the loss function.

Example probability distributions over return atoms for two different actions. Although they might have the same mean (expected Q-value), their shapes reveal different risk characteristics. Action A has higher potential returns but also higher potential losses compared to the more concentrated distribution of Action B.

Quantile Regression DQN (QR-DQN)

Proposed by Dabney et al. (2017), QR-DQN takes a different approach by modeling the quantile function (the inverse CDF) of the return distribution. Instead of fixing the return values (atoms) and learning probabilities, QR-DQN fixes cumulative probabilities $\tau_i$ and learns the corresponding return values (quantiles) $\theta_i(s, a)$ .

The network outputs $N$ quantile values $\theta_1(s, a), \dots, \theta_N(s, a)$ for each action $a$ . These correspond to a fixed set of $N$ target quantiles, often chosen uniformly, e.g., $\hat{\tau}_i = \frac{i - 0.5}{N}$ for $i=1, \dots, N$ . $\theta_i(s, a)$ represents the predicted return value $z$ such that $P(Z(s, a) \le z) \approx \hat{\tau}_i$ .

Learning Update: QR-DQN uses quantile regression loss. The target quantiles for a transition $(s, a, r, s')$ are $r + \gamma \theta_j(s', a'^*)$ , where $\theta_j(s', a'^*)$ are the quantile values predicted by the target network for the optimal next action $a'^* = \arg\max_{a'} \frac{1}{N} \sum_{k=1}^N \theta_k(s', a')$ . The loss function minimizes the discrepancy between the predicted quantiles $\theta_i(s, a)$ and the target quantiles, using a formulation (like the Quantile Huber loss) that correctly handles the asymmetric nature of quantile estimation.

Further advancements like Implicit Quantile Networks (IQN) learn a function that can generate quantile values for any input probability $\tau \in [0, 1]$ , offering a more continuous representation of the distribution.

Advantages of the Distributional Perspective

Learning the full distribution of returns offers several benefits:

Richer Learning Signal: The distribution provides more detailed information than a single expected value, potentially leading to more stable and effective learning, especially in environments with stochastic rewards or transitions. It helps disambiguate actions with similar means but different risk profiles.
State-of-the-Art Performance: Distributional RL algorithms, particularly C51 and QR-DQN, were shown to significantly improve performance on challenging benchmarks like the Atari suite, forming an important component of the Rainbow agent which combined multiple DQN improvements.
Risk Sensitivity: Having the return distribution allows for explicit risk-aware decision-making. Instead of just maximizing the mean $\mathbb{E}[Z(s, a)]$ , an agent could optimize for other statistics like a specific quantile (e.g., maximize the 10th percentile for risk-averse behavior) or Conditional Value-at-Risk (CVaR).

Implementing distributional RL requires modifying the network's output head to predict distributional parameters (probabilities for atoms or quantile values) and adapting the loss function (KL divergence or quantile regression loss) and the Bellman update mechanism accordingly. While adding complexity, the empirical gains and the ability to handle risk make it a significant development in deep reinforcement learning.

Was this section helpful?

References

A Distributional Perspective on Reinforcement Learning, Marc G. Bellemare, Will Dabney, Rémi Munos, 2017 Proceedings of the 34th International Conference on Machine Learning, Vol. 70 (PMLR) - Introduces the categorical approach to distributional reinforcement learning and the C51 algorithm.
Implicit Quantile Networks for Distributional Reinforcement Learning, Will Dabney, Georg Ostrovski, David Silver, Remi Munos, 2018 Proceedings of the 35th International Conference on Machine Learning, Vol. 80 (PMLR) - Extends QR-DQN by learning a continuous quantile function, allowing estimation of any quantile.
Rainbow: Combining Improvements in Deep Reinforcement Learning, Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, David Silver, 2018 Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Vol. 32 (AAAI Press) DOI: 10.1609/aaai.v32i1.11792 - Demonstrates strong empirical performance by combining several Deep Q-Network enhancements, including C51.