While standard Actor-Critic methods provide a way to reduce variance compared to REINFORCE, applying them directly to environments with continuous action spaces presents challenges. Algorithms like A2C often rely on sampling actions from a stochastic policy π(a∣s) and computing expectations, which becomes complex when the action space is infinite.
Deep Deterministic Policy Gradient (DDPG) offers an elegant solution by adapting ideas from Deep Q-Networks (DQN) into an off-policy actor-critic framework specifically designed for continuous control. It was introduced by Lillicrap et al. in 2015 and has been influential in tackling problems like robotic manipulation and autonomous driving simulators.
The core modification in DDPG is the use of a deterministic policy, denoted as μ(s;θμ). Instead of outputting a probability distribution over actions, the actor network directly maps a state s to a specific action a:
a=μ(s;θμ)Here, θμ represents the parameters of the actor network. This shift simplifies the policy gradient computation significantly. Recall the standard policy gradient theorem involves an expectation over actions sampled from the policy. For a deterministic policy, the expectation disappears, and the gradient of the objective function J(θμ) (expected return) can be derived using the chain rule, leveraging the critic's evaluation of the actor's chosen action.
DDPG maintains two primary neural networks, plus target versions of each:
Here's a visual representation of the DDPG architecture:
The DDPG agent interacts with the environment, storing experiences in a replay buffer. Updates involve sampling from the buffer, calculating target values using target networks, and adjusting actor and critic parameters based on the TD error and the policy gradient, respectively. Target networks are updated slowly towards the main networks.
The critic network Q(s,a;θQ) is trained similarly to the Q-network in DQN. It aims to minimize the Mean Squared Bellman Error (MSBE). We sample a mini-batch of transitions (si,ai,ri,si′) from the replay buffer D. For each transition, we compute the target value yi:
yi=ri+γQ′(si′,μ′(si′;θμ′);θQ′)Note that the target value calculation uses the target actor network μ′ to select the next action ai′=μ′(si′;θμ′) and the target critic network Q′ to evaluate the value of that next state-action pair. This use of target networks (Q′ and μ′) decouples the target value from the network being actively trained (Q), significantly improving stability, just like in DQN.
The critic's loss function is then the mean squared difference between the target values yi and the critic's current estimate Q(si,ai;θQ):
L(θQ)=N1i∑(yi−Q(si,ai;θQ))2This loss is minimized using gradient descent with respect to the critic parameters θQ.
The actor network μ(s;θμ) is updated using the deterministic policy gradient. The objective is to adjust the actor's parameters θμ to produce actions that maximize the expected Q-value according to the current critic. The gradient of the actor's objective function J(θμ) is approximated using the mini-batch from the replay buffer:
∇θμJ(θμ)≈N1i∑∇aQ(si,a;θQ)∣s=si,a=μ(si;θμ)∇θμμ(si;θμ)This looks complicated, but intuitively it means:
This update uses the critic Q as an evaluator, guiding the actor towards better actions without needing to sample actions and estimate expectations directly, making it suitable for continuous spaces.
DDPG is an off-policy algorithm. Like DQN, it uses two mechanisms borrowed from value-based methods to improve stability and sample efficiency:
Since the policy μ(s;θμ) is deterministic, it will always output the same action for a given state during training if left alone. This prevents exploration. To ensure the agent explores the environment sufficiently, DDPG adds noise to the actor's output action during training only:
at=μ(st;θμ)+NtThe noise process Nt can be simple Gaussian noise, but the original DDPG paper used an Ornstein-Uhlenbeck (OU) process, which generates temporally correlated noise. This correlated noise can be helpful for exploration in physical control tasks where momentum is a factor. However, simpler uncorrelated Gaussian noise often works well too and is easier to implement. The scale of the noise is typically decayed over the course of training. During evaluation or deployment, this noise is turned off, and the agent acts purely based on at=μ(st;θμ).
DDPG combines the actor-critic structure with insights from DQN (replay buffer, target networks) to create an off-policy algorithm effective for continuous action spaces.
Strengths:
Weaknesses:
Despite its weaknesses, DDPG represented a significant step forward for deep reinforcement learning in continuous domains and provides a foundation for understanding more recent algorithms like TD3 and SAC.
© 2025 ApX Machine Learning