Many real-world scenarios involve not just one decision-maker, but multiple entities acting and reacting within a shared space. Think of a team of robots coordinating to assemble a product, self-driving cars navigating a busy intersection, or players competing in a complex strategy game. These are all examples of Multi-Agent Systems (MAS).
Formally, a Multi-Agent System consists of:
This interaction is the defining characteristic that distinguishes multi-agent problems from single-agent ones. While we can model a single agent's task using a Markov Decision Process (MDP), the presence of other adapting agents introduces significant complications.
Consider applying a standard single-agent algorithm, like Q-learning, directly to an agent within a MAS. From this agent's viewpoint, the other agents are simply part of the environment. However, if these other agents are also learning and changing their strategies (policies), the environment's dynamics are no longer stationary. The transition probabilities P(s′∣s,a) and reward function R(s,a) implicitly depend on the joint action of all agents. When other agents change their policies πj(aj∣s), the effective transition probabilities and rewards experienced by our learning agent (agent i, taking action ai) shift over time.
This leads to the fundamental challenge in MARL: non-stationarity.
Imagine agent i learning the value of taking action a in state s. It receives feedback based on the outcome, which depends on what actions agents j,k,… took simultaneously. If agents j,k,… are improving their policies, the outcomes associated with agent i's action a in state s will change. What seemed like a good action yesterday might be detrimental today because other agents have adapted. This violates the stationarity assumption underpinning the convergence guarantees of many single-agent RL algorithms. Learning can become unstable, slow, or fail to converge altogether.
Interaction loop in a two-agent system. Each agent's policy updates influence the environment dynamics perceived by the other, leading to non-stationarity.
The nature of agent interactions also shapes the problem:
Because single-agent methods struggle with non-stationarity and the complexities of joint action spaces, Multi-Agent Reinforcement Learning (MARL) develops specialized frameworks and algorithms. This chapter examines several approaches designed to address these issues, enabling agents to learn effective strategies in shared environments. We will look at methods ranging from simple independent learning to more sophisticated techniques involving centralized training and value decomposition.
© 2025 ApX Machine Learning