As we've seen, training agents independently in a multi-agent setting (like using Independent Q-Learning or IQL) faces a significant hurdle: non-stationarity. From agent i's perspective, the environment appears to change as other agents (j=i) update their policies, making it difficult for agent i's learning process to converge. It's like trying to hit a moving target that's also reacting to your attempts.
On the other extreme, a fully centralized controller that observes everything and dictates actions for all agents avoids non-stationarity but scales poorly and often requires unrealistic communication capabilities during execution. Imagine needing a central command center coordinating every footstep of multiple robots in real-time. This is often impractical.
Centralized Training with Decentralized Execution (CTDE) offers a compelling middle ground. It aims to get the best of both worlds: leverage global information during the learning phase to stabilize training and mitigate non-stationarity, while still producing policies that allow agents to act independently using only their local observations during execution.
The core idea is simple yet effective: train using extra information, execute using local information.
Information flow in a CTDE setup. During training (blue background), a centralized component uses global information to guide the learning of decentralized actors (light blue background). During execution, agents only use their local observations oi to select actions ai.
Centralized Training: During the training phase, we assume access to more information than will be available at execution time. This might include the observations and actions of all agents, or even the underlying global state s of the environment. This extra information is typically fed into a centralized component, often a critic in actor-critic architectures.
Decentralized Execution: Once training is complete, the centralized component is discarded. Each agent i deploys its learned policy πi(ai∣oi), which selects actions based only on its own local observation history oi.
While powerful, CTDE isn't a magic bullet. The primary consideration is ensuring that the decentralized policies learned using centralized information can still perform well when that extra information is removed during execution. This often requires careful design of the centralized training mechanism and the individual policies. Additionally, while the execution is decentralized, the training process still requires a mechanism to gather and process potentially large amounts of global information, which can introduce its own computational bottlenecks depending on the scale of the problem and the specific algorithm.
In summary, CTDE represents a highly effective and widely adopted strategy in MARL. It tackles the non-stationarity problem head-on during training while preserving the practicality of decentralized execution, forming the basis for many state-of-the-art multi-agent algorithms.
© 2025 ApX Machine Learning