Building on the general Actor-Critic framework where an actor learns the policy and a critic learns a value function, we now examine a specific and popular implementation: Advantage Actor-Critic (A2C). Standard policy gradient methods like REINFORCE update policy parameters based on the total return Gt. While unbiased, using the full return often introduces high variance into the gradient estimates, making learning unstable and slow. The core idea behind A2C is to use a more informative signal than the raw return to guide the policy updates.
Instead of simply asking "was this overall trajectory good or bad?", A2C focuses on "was this specific action at taken in state st better or worse than expected?". This relative measure is captured by the Advantage function, defined as:
A(s,a)=Q(s,a)−V(s)Here, Q(s,a) is the action-value function (the expected return after taking action a in state s) and V(s) is the state-value function (the expected return from state s following the current policy). The advantage A(s,a) quantifies how much better action a is compared to the average action taken from state s under the current policy π. If A(s,a)>0, the action a was better than average; if A(s,a)<0, it was worse. Using the advantage as the scaling factor for policy updates provides a lower-variance signal compared to using just the return Gt because the baseline V(s) subtracts out the average value of the state, focusing the update on the relative quality of the chosen action.
In practice, we don't have access to the true Q(s,a) and V(s). A2C uses the critic's learned value function estimate, Vϕ(s) (parameterized by ϕ), to approximate the advantage. A common way to estimate the advantage for a transition (st,at,rt+1,st+1) relies on the Temporal Difference (TD) error:
A(st,at)≈rt+1+γVϕ(st+1)−Vϕ(st)Here, rt+1+γVϕ(st+1) serves as an estimate of Q(st,at) (the immediate reward plus the discounted estimated value of the next state). Subtracting the critic's estimate of the current state's value, Vϕ(st), gives us the TD error, which acts as a one-step estimate of the advantage. While multi-step returns can sometimes provide better estimates, this single-step TD error is frequently used due to its simplicity and lower variance.
A common implementation pattern for A2C involves a neural network architecture where initial layers are shared between the actor and the critic, followed by separate output heads.
Diagram illustrating a typical A2C network with shared layers feeding into separate actor (policy) and critic (value) output heads.
The learning process involves updating both the actor and critic parameters, typically using gradient descent based on data collected from interacting with the environment.
The critic's parameters ϕ are updated to minimize the difference between its value estimate Vϕ(st) and a target value, usually the TD target rt+1+γVϕ(st+1). The loss function for the critic is typically the mean squared error:
Lcritic(ϕ)=Et[(rt+1+γVϕ(st+1)−Vϕ(st))2]Minimizing this loss encourages the critic Vϕ(s) to become a more accurate predictor of the expected discounted future rewards. Note that the target value rt+1+γVϕ(st+1) uses the current critic parameters ϕ; this is a form of bootstrapping.
The actor's parameters θ are updated using a policy gradient approach, but scaled by the advantage estimate derived from the critic. The objective is to increase the probability of actions that lead to positive advantages and decrease the probability of actions leading to negative advantages. The gradient ascent update direction for the actor is based on:
∇θJ(θ)≈Et[∇θlogπθ(at∣st)×A(st,at)]where A(st,at) is the estimated advantage, rt+1+γVϕ(st+1)−Vϕ(st). Notice that the advantage term A(st,at) is treated as a constant when calculating the gradient with respect to θ; we only differentiate the policy term logπθ(at∣st). This update directly pushes the policy towards actions deemed better than average by the critic.
In practice, the updates often happen simultaneously by minimizing a combined loss function:
L(θ,ϕ)=Lactor(θ)+β×Lcritic(ϕ)where Lactor(θ)=−Et[logπθ(at∣st)×A(st,at)] (note the negative sign for minimization), and β is a hyperparameter weighting the critic loss (often set to 0.5 or 1.0). Sometimes, an entropy bonus term for the policy is added to the actor loss to encourage exploration, which can prevent the policy from converging prematurely to a suboptimal deterministic strategy.
The standard A2C algorithm typically employs multiple parallel workers (actors) that collect experience from different copies of the environment simultaneously. These workers run for a fixed number of steps, calculate their respective advantage estimates and gradients, and then these gradients are aggregated (usually averaged) to perform a single, synchronous update on the shared model parameters (θ and ϕ). This synchronous nature distinguishes it from its asynchronous counterpart, A3C, which we will discuss next. The synchronous updates can lead to more stable training compared to fully asynchronous methods, especially when using GPUs effectively.
By leveraging the critic's value estimates to compute the advantage, A2C provides a more stable and often more efficient learning signal for the actor compared to vanilla policy gradient methods like REINFORCE. It combines the benefits of policy-based methods (direct policy optimization, handling continuous actions) with the variance reduction techniques inspired by value-based methods.
© 2025 ApX Machine Learning