Maximizing a shared team reward is a common goal in cooperative multi-agent settings. The Centralized Training with Decentralized Execution (CTDE) approach offers a practical method: agents learn together using potentially global information but must act independently at execution time based only on their local observations. A primary challenge within CTDE is coordinating actions effectively. When a centralized joint action-value function, , is learned (where is the global state or some representation available during training, and is the joint action ), a primary question is how to extract decentralized policies that agents can use based on their local observation .
Performing a global argmax over for is often computationally intractable, especially as the number of agents and actions grows. Value Decomposition Methods address this by learning individual agent value functions and then combining them in a structured way to approximate , ensuring that maximizing can be achieved by each agent maximizing its own .
The simplest approach to value decomposition is the Value Decomposition Network (VDN). VDN assumes that the joint action-value function can be additively decomposed into individual agent value functions:
Here, is the number of agents. Each is typically represented by a neural network that takes agent 's local observation and action as input (often, the network outputs Q-values for all possible actions given ). The global is simply the sum of these individual Q-values.
How it enables CTDE:
VDN architecture. Each agent has an independent Q-network. During training, their outputs are summed to form Q_tot for loss calculation. During execution, each agent acts greedily based on its own Q_i.
Limitations: The main limitation of VDN is its restrictive assumption of additivity. It can only represent joint action-value functions where the contribution of each agent is independent of the others' actions. This prevents it from modeling more complex coordination scenarios where the value of one agent's action heavily depends on what other agents are doing.
QMIX (Q-Mixing) addresses the representational limitations of VDN while maintaining the convenient decentralized execution property. Instead of a simple summation, QMIX uses a mixing network to combine the individual values into .
The function , represented by the mixing network, is designed to satisfy an important monotonicity constraint:
This constraint ensures that if an agent increases its individual value, the global value will either increase or stay the same, but never decrease.
Why Monotonicity Matters:
The monotonicity constraint is sufficient to guarantee that a global argmax over yields the same result as performing individual argmax operations on each :
This means QMIX retains the ease of decentralized execution found in VDN, where each agent simply picks the action maximizing its own learned .
Mixing Network Architecture: The mixing network enforces the monotonicity constraint by having non-negative weights. It typically takes the outputs of the individual networks as input. Crucially, the weights and biases of the mixing network itself are generated by separate hypernetworks that receive the global state as input. This allows the way the individual values are combined to depend on the overall context provided by the state, making QMIX much more expressive than VDN. The non-negativity of the weights is usually enforced by using an absolute activation function or ReLU on the output of the weight-generating hypernetworks.
QMIX architecture. Individual agent Q-networks ( to ) operate decentrally. Their outputs feed into a central mixing network. The mixing network's parameters (weights and biases) are generated by hypernetworks conditioned on the global state , ensuring monotonicity () and allowing complex, state-dependent combinations while preserving tractable decentralized execution.
Advantages over VDN: QMIX can represent a much richer class of cooperative MARL problems than VDN because the mixing network can learn complex non-linear combinations of the individual agent values, conditioned on the global state. The only structural constraint is monotonicity, which is significantly less restrictive than pure additivity.
Summary: VDN and QMIX are prominent value-based methods within the CTDE framework for cooperative MARL. They learn individual agent Q-functions that allow for decentralized execution.
These methods provide effective ways to coordinate teams of agents by factorizing the team's value function, forming a foundation for many advanced techniques in cooperative multi-agent reinforcement learning.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with