As established, Actor-Critic methods aim to leverage the strengths of both value-based and policy-based approaches. They achieve this by maintaining two distinct components, often represented by separate neural networks (or sometimes networks with shared layers), which learn and operate concurrently: the Actor and the Critic.
The Actor: The Policy Maker
The Actor's role is to control how the agent behaves. It directly learns and represents the policy, which is a mapping from a state to an action (or a probability distribution over actions). Think of the Actor as the component responsible for deciding what to do in a given situation.
- Function: Learns a parameterized policy, denoted as πθ(a∣s), where θ represents the parameters (e.g., the weights of a neural network).
- Output: For discrete action spaces, the Actor typically outputs probabilities for each possible action. For continuous action spaces, it might output the parameters of a probability distribution (like the mean and standard deviation of a Gaussian distribution) from which the action is sampled.
- Goal: To adjust its parameters θ such that the policy maximizes the expected cumulative reward. It does this by "listening" to the feedback provided by the Critic.
The Critic: The Action Evaluator
The Critic's role is to evaluate the actions taken by the Actor. It doesn't decide on actions itself. Instead, it learns a value function that estimates how good it is to be in a certain state or how good a specific action taken in that state is. Think of the Critic as the component responsible for judging how good the Actor's chosen action was.
- Function: Learns a parameterized value function, often the state-value function Vϕ(s) or the action-value function Qϕ(s,a), where ϕ represents the Critic's parameters. In many modern Actor-Critic variants (like A2C, which we'll see later), the Critic primarily learns the state-value function Vϕ(s).
- Output: A scalar value representing the estimated return (cumulative future reward) from the current state or state-action pair.
- Goal: To accurately estimate the value function under the current policy being followed by the Actor. It learns this by observing the rewards received and state transitions resulting from the Actor's interactions with the environment, typically using methods related to Temporal Difference (TD) learning.
The Interaction Loop
The power of the Actor-Critic architecture comes from the interaction between these two components during the learning process.
- Action Selection: The Actor observes the current state st from the environment and selects an action at according to its current policy πθ(a∣st).
- Environment Interaction: The agent performs action at in the environment, receiving a reward rt+1 and transitioning to the next state st+1.
- Critic Evaluation: The Critic observes the transition (st,at,rt+1,st+1). It uses this information, often in conjunction with its current value estimate for st and st+1, to compute a signal that evaluates the action at. A common evaluation signal is the TD error or the Advantage function A(st,at)=Q(st,at)−V(st), which measures whether the action taken led to a better or worse outcome than expected from state st.
- Parameter Updates:
- Critic Update: The Critic updates its parameters ϕ to improve its value estimates based on the observed reward and next state, typically minimizing a loss based on the TD error (e.g., minimizing (rt+1+γVϕ(st+1)−Vϕ(st))2).
- Actor Update: The Actor updates its policy parameters θ based on the evaluation signal provided by the Critic. If the Critic indicates that action at led to a better-than-expected outcome (positive TD error or advantage), the Actor adjusts θ to increase the probability of selecting at in state st in the future. Conversely, if the outcome was worse than expected, it decreases the probability. This update typically follows the direction suggested by the policy gradient, but scaled by the Critic's evaluation.
This interplay allows the Actor to benefit from the learned value function of the Critic. Instead of relying solely on the often noisy returns from entire episodes (like in basic REINFORCE), the Actor gets more immediate and stable feedback from the Critic's TD error or advantage estimates. This generally leads to lower variance in the policy updates and more efficient learning.
Diagram illustrating the flow of information in a typical Actor-Critic setup. The Agent, containing the Actor and Critic, interacts with the Environment. The Actor chooses actions based on the state, the Environment provides rewards and next states, and the Critic evaluates the Actor's actions, providing feedback used to update both components.
By separating the tasks of action selection and action evaluation, Actor-Critic methods provide a flexible and powerful framework. In the following sections, we will look at specific algorithms like Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C) that implement these ideas effectively.