Building upon the fundamental actor-critic architecture, we now examine specific algorithms that significantly improved stability and performance: Advantage Actor-Critic (A2C) and its asynchronous counterpart, Asynchronous Advantage Actor-Critic (A3C). Both methods leverage the advantage function to reduce the variance inherent in basic policy gradient estimates.
Recall the policy gradient theorem, which suggests updating the policy parameters θ in the direction of ∇θlogπθ(a∣s) weighted by some measure of the action's goodness. While using the full return Gt is unbiased, it suffers from high variance. Using a baseline, such as the state-value function V(s), leads to the advantage function:
A(s,a)=Q(s,a)−V(s)The policy gradient update then becomes:
∇θJ(θ)≈Es∼dπ,a∼πθ[∇θlogπθ(a∣s)A(s,a)]Intuitively, the advantage A(s,a) measures how much better action a is compared to the average action taken from state s. Using the advantage centers the gradient signal, encouraging good actions (positive advantage) and discouraging bad ones (negative advantage), leading to more stable updates.
In practice, we don't know the true Q(s,a) and V(s). Actor-critic methods use a learned critic Vϕ(s) (parameterized by ϕ) to approximate V(s). A common estimate for the advantage function at time t uses the Temporal Difference (TD) error:
Aest(st,at)≈rt+1+γVϕ(st+1)−Vϕ(st)=δtHere, rt+1+γVϕ(st+1) serves as an estimate of Q(st,at), and Vϕ(st) is the baseline provided by the critic. The critic itself is trained to minimize the error in this value prediction, typically using a Mean Squared Error (MSE) loss on the TD error:
L(ϕ)=E[(rt+1+γVϕ(st+1)−Vϕ(st))2]A2C and A3C primarily differ in how they gather experience and apply these updates.
A2C is a synchronous, parallelized version of the advantage actor-critic concept. It works as follows:
A key characteristic of A2C is its synchronicity. All actors wait for each other before the central network update occurs. This typically leads to more effective use of GPUs, which excel at processing large batches of data simultaneously. While originally A3C was presented first, A2C is often found to be simpler to implement and can achieve comparable or better performance, especially when GPU resources are available.
A3C introduced the idea of asynchronous updates to stabilize learning without relying on an experience replay buffer like DQN. It employs multiple workers, each with its own environment instance and a local copy of the actor-critic network. These workers operate independently and update a shared global network.
Asynchronous training structure in A3C. Multiple workers interact with their environments, compute gradients locally using parameters pulled from the global network, and then push these gradients to update the global network asynchronously.
The A3C workflow for each worker looks like this:
The core idea behind A3C's effectiveness is that the asynchronous updates from multiple workers exploring different parts of the environment help to decorrelate the data used for training. Each worker's experience stream is likely different, and applying updates asynchronously prevents the instability that can arise from strongly correlated updates in simpler online RL methods. This often leads to faster training times on multi-core CPUs compared to single-threaded algorithms or even A2C (when CPU-bound).
Feature | A2C (Advantage Actor-Critic) | A3C (Asynchronous Advantage Actor-Critic) |
---|---|---|
Updates | Synchronous | Asynchronous |
Parallelism | Waits for all actors before updating | Workers update global network independently |
Hardware Use | Efficient on GPUs (batch processing) | Efficient on multi-core CPUs (parallel execution) |
Data Decorr. | Achieved via batching diverse trajectories | Achieved via asynchronous updates from workers |
Implementation | Generally simpler | More complex (threading, parameter server) |
Stability | Often considered slightly more stable/reproducible | Can sometimes suffer from stale gradients |
While A3C was a groundbreaking algorithm, demonstrating the power of asynchronous training, A2C (often implemented with GAE for advantage estimation) has become a very popular and strong baseline. It often achieves similar or better results than A3C, particularly in GPU-accelerated environments, and tends to be easier to implement and debug.
Both A2C and A3C represent significant steps forward from basic policy gradient methods by effectively using a critic to estimate the advantage function, thereby reducing gradient variance and enabling more stable and efficient learning. They form the foundation upon which further refinements like TRPO and PPO were built.
© 2025 ApX Machine Learning