In single-agent reinforcement learning, we typically operate under the assumption of a stationary environment. This means the rules governing transitions between states (P(s′∣s,a)) and the rewards received (R(s,a)) remain fixed over time. An agent interacts with this stable world, learning how its actions influence outcomes. This stationarity is a fundamental assumption underpinning the convergence guarantees of many RL algorithms like Q-learning and DQN.
However, this assumption breaks down dramatically in multi-agent settings. When multiple agents are learning simultaneously within the same environment, the world becomes non-stationary from the perspective of any individual agent.
Consider an agent i among N agents. Its goal is to learn an optimal policy πi. The environment's state s transitions to s′ based on the joint action a=(a1,a2,...,aN) taken by all agents, according to the true transition probability P(s′∣s,a). Agent i receives a reward ri, which might also depend on the joint action, Ri(s,a).
Agent i observes the state s, selects its action ai according to its current policy πi(ai∣s), receives reward ri, and observes the next state s′. From agent i's local viewpoint, the transition seems to depend only on its own action ai. However, the actual transition probability it experiences, let's call it Pi(s′∣s,ai), is determined by averaging over the actions a−i=(a1,...,ai−1,ai+1,...,aN) taken by all other agents according to their policies π−i:
Pi(s′∣s,ai)=a−i∑P(s′∣s,ai,a−i)j=i∏πj(aj∣s)Similarly, the expected reward agent i perceives for taking action ai in state s, denoted Ri(s,ai), depends on what the other agents do:
Ri(s,ai)=a−i∑Ri(s,ai,a−i)j=i∏πj(aj∣s)Here lies the core problem: As the other agents j=i learn and update their policies πj, the effective transition probabilities Pi(s′∣s,ai) and expected rewards Ri(s,ai) that agent i experiences change over time. The environment, from agent i's perspective, is no longer stationary. The Markov property, which states that the future is independent of the past given the present state (and action), is effectively violated from the local viewpoint of agent i because the underlying dynamics depend on hidden variables: the changing policies of other agents.
Agent B experiences changing environment dynamics (transition probabilities) because Agent A is simultaneously updating its policy. What worked for Agent B at time t might not work at time t+1.
This non-stationarity poses significant challenges for learning:
Standard single-agent algorithms, like Q-Learning, rely heavily on the stationarity of the environment to guarantee convergence to optimal value functions or policies. When the target Q-values (used in the Bellman update) are constantly shifting because other agents' policies are changing, the learning process can become unstable. The algorithm might oscillate, diverge, or converge to a suboptimal solution representing a poor compromise between policies that are no longer relevant. It's like trying to hit a target that's constantly moving based on your previous attempts.
Techniques like Experience Replay, which were essential for stabilizing DQN, store past transitions (s,a,r,s′) in a replay buffer and sample from it to train the Q-network. This works well in stationary environments because old experiences remain valid representations of the environment's dynamics. In MARL, however, a transition collected when other agents were following policy π−iold might be drastically different from what would happen under their current policy π−inew. Replaying outdated experiences can introduce significant bias and hinder learning, as the agent trains on misleading information about how the multi-agent system currently behaves.
Assigning credit or blame for outcomes becomes much harder. Did agent i receive a low reward because its own action ai was poor, or because another agent j took an action aj that interfered or failed to coordinate effectively? Disentangling the contributions of different agents to the collective outcome is a complex problem exacerbated by the changing behavior of others.
Addressing this non-stationarity is a central theme in MARL research. Many advanced MARL algorithms, which we will explore shortly, incorporate specific mechanisms to mitigate these issues, often by allowing agents to access more information during training than they will have during execution (the Centralized Training with Decentralized Execution paradigm) or by attempting to model or anticipate the behavior of other agents.
© 2025 ApX Machine Learning