The most direct way to apply reinforcement learning in a multi-agent setting is to simply let each agent learn independently, treating all other agents as part of the environment dynamics. This approach is often referred to as Independent Learning. It leverages standard single-agent RL algorithms without modification, making it conceptually simple and easy to implement.
Imagine you have N agents in an environment. With independent learning, each agent i∈{1,...,N} maintains its own policy πi or value function (e.g., Q-function Qi) and learns purely based on its own experiences. Agent i's experience tuple typically consists of its local observation si, the action ai it took, the reward ri it received, and its next local observation si′.
During training, agent i updates its policy or value function using a standard algorithm, completely ignoring the fact that the other agents (j=i) are also learning and adapting their policies πj.
Two common instantiations of this idea are Independent Q-Learning (IQL) and Independent Deep Deterministic Policy Gradient (IDDPG).
In IQL, each agent i independently learns its own action-value function Qi(si,ai) using the standard Q-learning update rule or its deep learning variant, DQN. If using tabular Q-learning, the update for agent i would be:
Qi(si,ai)←Qi(si,ai)+α[ri+γa′maxQi(si′,a′)−Qi(si,ai)]Here, si and si′ are agent i's current and next states (or observations), ai is its action, ri is its reward, α is the learning rate, and γ is the discount factor. If using neural networks (Independent DQN), each agent would have its own DQN network, trained using its own replay buffer filled with its (si,ai,ri,si′) transitions. The loss function minimizes the difference between the target Q-value and the predicted Q-value, just as in single-agent DQN.
For environments with continuous action spaces, the DDPG algorithm can be applied independently by each agent. In IDDPG, each agent i maintains its own actor network μi(si∣θμi) and critic network Qi(si,ai∣θQi). Each agent updates its networks based on its own experiences, using the standard DDPG updates. The actor aims to maximize the expected return predicted by its critic, while the critic learns to accurately estimate the Q-value for the actor's policy. Again, each agent treats the others as static parts of the environment during its update step.
Each agent learns independently, interacting with the environment and receiving its own observations and rewards. There is no direct sharing of learning updates or policy information between agents.
The simplicity of independent learning comes at a significant cost. As mentioned in the chapter introduction, the primary challenge in MARL is non-stationarity. From the perspective of any single agent i, the environment appears non-stationary because the other agents j=i are simultaneously updating their policies.
Consider agent i's Q-learning update. The target value ri+γmaxa′Qi(si′,a′) depends on the next state si′. However, si′ is determined not only by agent i's action ai but also by the actions aj taken by all other agents j. Since the policies πj generating these actions aj are changing, the transition dynamics P(si′∣si,ai) effectively change from agent i's point of view.
This violates the fundamental Markov assumption that underlies Q-learning and many other RL algorithms. The assumption is that the environment's dynamics are stationary (fixed). When this assumption is broken:
Advantages:
Disadvantages:
Despite the significant drawback of non-stationarity, independent learning isn't entirely without merit. It can be effective in certain scenarios:
However, for most complex multi-agent problems, the non-stationarity introduced by independent learning necessitates more specialized approaches. Techniques that explicitly address agent interactions, such as parameter sharing, centralized training with decentralized execution (CTDE), value decomposition methods, or multi-agent policy gradients, are often required to achieve stable and effective learning. These methods will be explored in the subsequent sections.
© 2025 ApX Machine Learning