In cooperative multi-agent settings, a common goal is to maximize a shared team reward. The Centralized Training with Decentralized Execution (CTDE) paradigm offers a practical approach: agents learn together using potentially global information but must act independently at execution time based only on their local observations. A significant challenge within CTDE is coordinating actions effectively. If we learn a centralized joint action-value function, Qtot(s,a), where s is the global state (or some representation available during training) and a is the joint action (a1,...,aN), how do we extract decentralized policies πi(ai∣oi) that agents can use based on their local observation oi?
Performing a global argmax
over a for Qtot is often computationally intractable, especially as the number of agents and actions grows. Value Decomposition Methods address this by learning individual agent value functions Qi(oi,ai) and then combining them in a structured way to approximate Qtot, ensuring that maximizing Qtot can be achieved by each agent maximizing its own Qi.
The simplest approach to value decomposition is the Value Decomposition Network (VDN). VDN assumes that the joint action-value function can be additively decomposed into individual agent value functions:
Qtot(s,a)=i=1∑NQi(oi,ai)Here, N is the number of agents. Each Qi is typically represented by a neural network that takes agent i's local observation oi and action ai as input (often, the network outputs Q-values for all possible actions ai given oi). The global Qtot is simply the sum of these individual Q-values.
How it enables CTDE:
VDN architecture. Each agent has an independent Q-network. During training, their outputs are summed to form Q_tot for loss calculation. During execution, each agent acts greedily based on its own Q_i.
Limitations: The main limitation of VDN is its restrictive assumption of additivity. It can only represent joint action-value functions where the contribution of each agent is independent of the others' actions. This prevents it from modeling more complex coordination scenarios where the value of one agent's action heavily depends on what other agents are doing.
QMIX (Q-Mixing) addresses the representational limitations of VDN while maintaining the convenient decentralized execution property. Instead of a simple summation, QMIX uses a mixing network to combine the individual Qi values into Qtot.
Qtot(s,a)=f({Qi(oi,ai)}i=1N,s)The function f, represented by the mixing network, is designed to satisfy a crucial monotonicity constraint:
∂Qi∂Qtot≥0∀iThis constraint ensures that if an agent increases its individual Qi value, the global Qtot value will either increase or stay the same, but never decrease.
Why Monotonicity Matters:
The monotonicity constraint is sufficient to guarantee that a global argmax
over Qtot yields the same result as performing individual argmax
operations on each Qi:
This means QMIX retains the ease of decentralized execution found in VDN, where each agent simply picks the action maximizing its own learned Qi.
Mixing Network Architecture: The mixing network enforces the monotonicity constraint by having non-negative weights. It typically takes the outputs of the individual Qi networks as input. Crucially, the weights and biases of the mixing network itself are generated by separate hypernetworks that receive the global state s as input. This allows the way the individual Qi values are combined to depend on the overall context provided by the state, making QMIX much more expressive than VDN. The non-negativity of the weights is usually enforced by using an absolute activation function or ReLU on the output of the weight-generating hypernetworks.
QMIX architecture. Individual agent Q-networks (Q1 to QN) operate decentrally. Their outputs feed into a central mixing network. The mixing network's parameters (weights and biases) are generated by hypernetworks conditioned on the global state s, ensuring monotonicity (∂Qtot/∂Qi≥0) and allowing complex, state-dependent combinations while preserving tractable decentralized execution.
Advantages over VDN: QMIX can represent a much richer class of cooperative MARL problems than VDN because the mixing network can learn complex non-linear combinations of the individual agent values, conditioned on the global state. The only structural constraint is monotonicity, which is significantly less restrictive than pure additivity.
Summary: VDN and QMIX are prominent value-based methods within the CTDE framework for cooperative MARL. They learn individual agent Q-functions Qi that allow for decentralized execution.
These methods provide effective ways to coordinate teams of agents by factorizing the team's value function, forming a foundation for many advanced techniques in cooperative multi-agent reinforcement learning.
© 2025 ApX Machine Learning