To effectively design algorithms for multiple interacting agents, we need a framework that extends the familiar Markov Decision Process (MDP) used in single-agent RL. The standard mathematical model for multi-agent sequential decision-making under uncertainty is the Stochastic Game, also commonly referred to as a Markov Game. This formalism provides the foundation for understanding and tackling the complexities introduced by agent interactions.
A Stochastic Game (SG) generalizes the MDP by incorporating multiple agents whose actions collectively influence transitions and rewards. Let's define the components formally:
- A finite set of agents, N={1,2,...,N}, where N≥2.
- A state space, S, representing the possible configurations of the environment. This is typically assumed to be shared or fully observable by all agents, although partially observable extensions (POSGs or Dec-POMDPs) exist for more complex scenarios.
- An action space for each agent, Ai. Each agent i selects an action ai∈Ai.
- A joint action space, A=A1×A2×...×AN. A joint action a=(a1,a2,...,aN)∈A consists of one action chosen by each agent.
- A state transition probability function, P:S×A×S→[0,1]. This function defines the probability of transitioning to the next state s′∈S, given the current state s∈S and the joint action a∈A. We write this as P(s′∣s,a). Notice the critical difference from MDPs: the next state distribution depends on the actions taken by all agents, not just one.
- A reward function for each agent, Ri:S×A×S→R. Each agent i receives a reward ri based on the current state s, the joint action a, and potentially the next state s′. The reward ri=Ri(s,a,s′) is specific to agent i and generally depends on what every agent did.
- A discount factor, γ∈[0,1), shared by all agents for calculating cumulative rewards.
Just as in MDPs, agents select actions based on policies. In MARL, we deal with a joint policy π=(π1,π2,...,πN), where each πi is the policy for agent i. Agent i's policy, πi:S→P(Ai), maps states to probability distributions over its actions Ai. If the policies are deterministic, πi:S→Ai.
Visualizing the Interaction
Imagine a simple scenario with two robots (Agent 1 and Agent 2) needing to coordinate pushing a box to a target location (State).
The diagram shows how the current state s is observed by both agents. They independently choose actions a1 and a2, forming a joint action a. This joint action, along with the state s, dictates the transition probability to the next state s′ and determines the individual rewards r1 and r2 for each agent.
Non-Stationarity Explained Formally
The Stochastic Game formulation makes the non-stationarity problem, mentioned in the chapter introduction, explicit. Consider Agent i's perspective. It chooses action ai based on its policy πi(ai∣s). The environment's response (next state s′ and reward ri) depends on the joint action a=(ai,a−i), where a−i represents the actions of all other agents.
If the other agents' policies π−i are changing (as they typically do during learning), then the effective transition dynamics P(s′∣s,ai) and reward function Ri(s,ai) from Agent i's perspective are also changing, even if the underlying game dynamics P(s′∣s,a) and Ri(s,a,s′) are fixed. This violates the Markov property assumption required by standard single-agent RL algorithms like Q-learning, making direct application problematic. The environment appears non-stationary because the behavior of other learning agents is part of Agent i's effective environment.
Types of Multi-Agent Interactions
Stochastic Games can model various interaction types, characterized by the agents' reward structures:
- Fully Cooperative: All agents share the exact same reward function: R1=R2=...=RN. Their goal is collective success. Examples include coordinating robot teams or synchronized tasks.
- Fully Competitive (Zero-Sum): The agents have diametrically opposed goals. For two agents, R1=−R2. In general, ∑i=1NRi(s,a,s′)=0 for all s,a,s′. Classic board games like Chess or Go (ignoring draws) fall into this category.
- Mixed (General-Sum): This is the most general case, encompassing situations with elements of both cooperation and competition. Agents have individual reward functions that are not necessarily aligned or directly opposed. Examples include traffic navigation, resource sharing, or economic markets.
Objective and Solution Concepts
In single-agent RL, the objective is typically to find a policy π that maximizes the expected discounted cumulative reward, represented by the value function Vπ(s) or Q-function Qπ(s,a).
In MARL, the objective depends on the game type.
- In cooperative settings, the goal is often to find a joint policy π that maximizes a common objective, like the shared expected return E[∑t=0∞γtR(st,at,st+1)∣s0=s,π].
- In competitive or mixed settings, the situation is more complex. Agents might aim to maximize their own reward, potentially leading to concepts from game theory like finding a Nash Equilibrium. A Nash Equilibrium is a joint policy π∗=(π1∗,...,πN∗) such that no single agent i can improve its expected return by unilaterally changing its policy πi∗, assuming all other agents j=i keep their policies πj∗ fixed.
We can define agent-specific value functions based on the joint policy π:
- State-value function for agent i: Viπ(s)=Eπ[∑t=0∞γtRi(st,at,st+1)∣s0=s]
- Action-value function for agent i: Qiπ(s,a)=Eπ[∑t=0∞γtRi(st,at,st+1)∣s0=s,a0=a]
Understanding the Stochastic Game framework is essential because it precisely defines the problem MARL algorithms aim to solve. It highlights the dependencies between agents and provides the mathematical structure needed to analyze agent interactions and the resulting non-stationarity challenge that subsequent sections will address.