As discussed earlier, standard policy gradient methods like REINFORCE update the policy parameters θ based on the sampled return Gt from an entire episode. The gradient estimate often looks like ∇θJ(θ)≈E[∇θlogπθ(At∣St)Gt]. While unbiased, relying on the full Monte Carlo return Gt introduces significant variance because the return depends on all subsequent actions and rewards in the trajectory. A single high or low reward can drastically swing the gradient estimate, slowing down or destabilizing learning.
Actor-Critic methods offer a compelling alternative structure to mitigate this high variance. Instead of one component learning the policy, we introduce two distinct components, typically implemented as separate function approximators (often neural networks):
-
The Actor: This component is responsible for learning and representing the policy. It takes the current state s as input and outputs a probability distribution over actions (for stochastic policies) or a specific action (for deterministic policies). We denote the actor's policy as πθ(a∣s), parameterized by θ. The actor's goal is to learn the optimal policy by adjusting θ.
-
The Critic: This component learns a value function to evaluate the actions chosen by the actor or the states encountered. It takes state s (and sometimes action a) as input and outputs an estimate of value. Common choices for the critic's function are the state-value function Vϕ(s) or the action-value function Qϕ(s,a), parameterized by ϕ. The critic's role is not to select actions but to provide feedback on how good the actor's current policy is.
How They Interact
The core idea is that the actor decides what to do, and the critic evaluates how well it was done. This evaluation then guides the actor's updates. Here's the typical flow:
- Action Selection: The actor observes the current state St and selects an action At according to its policy πθ(At∣St).
- Environment Interaction: The action At is performed in the environment, leading to a reward Rt+1 and the next state St+1.
- Critic Evaluation: The critic uses the transition information (St,At,Rt+1,St+1) to evaluate the actor's action or the resulting state. A common way is to compute the Temporal Difference (TD) error:
δt=Rt+1+γVϕ(St+1)−Vϕ(St)
This TD error measures the discrepancy between the critic's current estimate Vϕ(St) and a potentially better estimate based on the immediate reward and the value of the next state (the TD target Rt+1+γVϕ(St+1)). If the critic estimates Qϕ(s,a), the TD error might be δt=Rt+1+γQϕ(St+1,At+1)−Qϕ(St,At), where At+1 is selected by the actor in state St+1.
- Critic Update: The critic updates its parameters ϕ to minimize this TD error, usually via gradient descent. The goal is to make its value estimates more accurate over time. For a state-value critic, the update might target minimizing (δt)2.
- Actor Update: The actor updates its policy parameters θ in the direction suggested by the critic's evaluation. Instead of using the noisy Monte Carlo return Gt, the actor uses the critic's feedback, often the TD error δt or a related quantity like the advantage A(St,At). The policy gradient update becomes something like:
∇θJ(θ)≈E[∇θlogπθ(At∣St)δt](or using Advantage)
A positive TD error suggests the action At led to a better-than-expected outcome, so the probability of selecting At in state St should be increased. A negative TD error suggests the opposite.
This interaction loop allows the actor and critic to improve concurrently. The critic learns to provide better evaluations, and the actor learns to produce better actions based on those evaluations.
Basic Actor-Critic interaction loop. The actor selects actions based on the state, the environment responds, and the critic evaluates the outcome, providing feedback used to update both the actor's policy and its own value estimates.
Benefits of the Actor-Critic Structure
Compared to REINFORCE, the primary advantage of the Actor-Critic framework is variance reduction. The TD error δt (or advantage) used for the actor update depends mainly on the immediate reward Rt+1 and the critic's estimate of the next state's value Vϕ(St+1). This estimate, while potentially biased (since Vϕ is learned), is generally much less noisy than the full Monte Carlo return Gt, which accumulates noise over many time steps. Lower variance often leads to faster and more stable convergence.
Furthermore, Actor-Critic methods can learn online, updating the actor and critic after each step (or small batch of steps), unlike basic REINFORCE which typically requires waiting until the end of an episode to compute Gt.
Variants and Considerations
The specific form of the critic's value function significantly impacts the algorithm:
- State-Value Critic (Vϕ(s)): The critic estimates the value of being in a state. The evaluation signal is often the TD error δt=Rt+1+γVϕ(St+1)−Vϕ(St). This δt directly approximates the advantage A(St,At) under certain assumptions.
- Action-Value Critic (Qϕ(s,a)): The critic estimates the value of taking action a in state s. This is common in algorithms designed for continuous action spaces (like DDPG) or when a direct estimate of Q-values is needed. The update typically involves the TD error δt=Rt+1+γQϕ(St+1,At+1)−Qϕ(St,At).
- Advantage Function Critic: Some methods explicitly try to estimate the advantage function A(s,a)=Q(s,a)−V(s). Using the advantage directly in the policy gradient update, ∇θJ(θ)≈E[∇θlogπθ(At∣St)A(St,At)], is often beneficial because it provides a relative measure of how much better an action is compared to the average action from that state, further reducing variance. We will see sophisticated methods for estimating the advantage, like Generalized Advantage Estimation (GAE), later in this chapter.
In deep reinforcement learning, both the actor πθ and the critic (Vϕ or Qϕ) are usually represented by neural networks. Their parameters, θ and ϕ, are updated using gradient-based optimization techniques based on the principles outlined above.
This fundamental Actor-Critic architecture serves as the foundation for many powerful algorithms we will discuss next, including Advantage Actor-Critic (A2C/A3C), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). These methods refine the basic structure by improving how the advantage is estimated, how the updates are performed for better stability, or how exploration is handled.