While independent learning approaches treat other agents as part of the environment, leading to non-stationarity issues, and value decomposition methods often require specific cooperative structures, we need a more general approach that can handle mixed cooperative-competitive scenarios and continuous action spaces. Enter the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm.
MADDPG builds upon the single-agent Deep Deterministic Policy Gradient (DDPG) algorithm and cleverly adapts the Centralized Training with Decentralized Execution (CTDE) paradigm. The core idea is to learn a centralized critic for each agent that considers the observations and actions of all agents, while each agent maintains its own decentralized actor that only uses local observations for execution.
Imagine N agents interacting in an environment. In MADDPG:
Decentralized Actors: Each agent i possesses its own actor network μi, parameterized by θi. This actor takes the agent's local observation oi as input and outputs a deterministic action ai=μi(oi;θi). This is identical to the actor in DDPG and allows for decentralized execution, as each agent only needs its own observation to decide its action.
Centralized Critics: Each agent i also has a corresponding critic network Qi, parameterized by ϕi. Unlike the DDPG critic which only takes the agent's own state and action, the MADDPG critic Qi takes as input the joint observations (or state) of all agents x=(o1,o2,...,oN) and the joint actions taken by all agents a=(a1,a2,...,aN). It outputs the estimated Q-value for agent i: Qi(x,a;ϕi).
This centralized critic is the key to addressing non-stationarity. Since Qi observes everyone's actions, the learning target for the critic remains stable even if other agents' policies μj (j=i) are changing during training. The critic learns the value of agent i's action in the context of what all other agents are doing.
Overview of the MADDPG architecture. Each agent i uses its local observation oi to generate an action ai via its actor μi. During training, experiences are stored in a replay buffer. Batches are sampled, and the joint information (all observations x, all actions a) is fed into each agent's centralized critic Qi. The critic Qi is used to train the actor μi via policy gradients.
Training follows the actor-critic template, extending DDPG's update rules to the multi-agent scenario using the centralized critics. We maintain target networks for both actors (μi′) and critics (Qi′) for stability, updating them slowly using Polyak averaging, just like in DDPG.
Each critic Qi is updated by minimizing a standard Mean Squared Bellman Error (MSBE) loss. We sample a minibatch of transitions (x,a,r,x′) from the shared replay buffer D, where x′ represents the joint next observations and r=(r1,...,rN) contains the rewards for each agent. The target value yi for agent i's critic is calculated using the target networks:
yi=ri+γQi′(x′,a1′,...,aN′;ϕi′)aj′=μj′(oj′)Here, the next actions aj′ are computed using the target actors μj′ based on the next local observations oj′ contained within x′. The loss function for critic i is then:
L(ϕi)=E(x,a,r,x′)∼D[(Qi(x,a;ϕi)−yi)2]This loss is minimized using gradient descent. The crucial part is that Qi′ uses the actions from all target actors μj′, ensuring the target calculation is consistent with the joint behavior anticipated by the target policies.
Each actor μi aims to produce actions that maximize its expected return, as estimated by its corresponding centralized critic Qi. The policy gradient for actor i is derived similarly to DDPG, but uses the multi-agent critic Qi:
∇θiJ(θi)≈Ex∼D,a∼μ[∇θiμi(oi)∇aiQi(x,a1,...,aN;ϕi)ai=μi(oi)]Let's unpack this gradient:
This update rule effectively pushes the actor μi to output actions that the centralized critic Qi deems better, considering the joint context. Note that computing this gradient only requires knowledge of x and the current policies μj of all agents; it doesn't require knowing other agents' critic parameters.
One challenge is that policies can change rapidly during training, potentially misleading other agents. MADDPG proposes an enhancement: learn an ensemble of K policies for each agent. When updating an agent's policy, take the ensemble average action from other agents into account in the target yi and policy gradient J(θi). This can lead to more robust policies that are less prone to exploiting transient behaviors of other agents. However, this adds complexity.
Advantages:
Disadvantages:
MADDPG offers a powerful and widely used framework for multi-agent reinforcement learning, particularly effective in scenarios involving continuous actions and mixed agent interactions. By employing centralized critics during training, it elegantly sidesteps the non-stationarity problem that plagues simpler independent learning methods. While it has scalability limitations related to the centralized critic's input size, its conceptual clarity and strong performance on various benchmarks make it a significant algorithm in the MARL toolkit. Understanding MADDPG provides a solid foundation for tackling complex multi-agent problems where agents must learn coordinated or competitive strategies.
© 2025 ApX Machine Learning