While adding random perturbations directly to an agent's selected actions (action space noise) is a common exploration strategy, particularly in continuous control, it can sometimes lead to erratic and inefficient exploration. This is because the noise applied at each timestep is often independent, causing the agent's behavior to rapidly fluctuate without consistent direction. Consider an alternative: what if we introduced noise not to the final action, but to the decision-making process itself, specifically to the parameters of the policy network? This is the core idea behind parameter space noise.
Instead of sampling actions from a noisy version of the policy's output distribution or adding noise to a deterministic action, parameter space noise involves perturbing the weights of the policy network directly. Let the policy be represented by a function πθ, parameterized by weights θ. With parameter space noise, we sample a noise vector ϵ (typically from a Gaussian distribution N(0,σ2I)) and modify the policy parameters for a certain duration, often an entire episode. The agent then acts according to this perturbed policy πθ+ϵ.
Action Noise:at=πθ(st)+noise Parameter Noise:at=πθ+ϵ(st),where ϵ∼N(0,σ2I) is sampled periodicallyThe key difference lies in the temporal consistency of the exploration. Because the parameters θ+ϵ remain fixed for the duration of an episode (or several timesteps), the resulting behavior, while exploratory, is more consistent and structured compared to adding independent noise at each step. Think of it as exploring with a slightly different, but temporarily fixed, "personality" in each episode.
Noise injection points for Action Space Noise versus Parameter Space Noise. Parameter noise modifies the policy network's weights (θ) directly, leading to temporally correlated exploration.
Implementing parameter space noise requires careful choices:
# Conceptual Python Snippet for Parameter Noise in an Episodic Loop
def train_agent_with_param_noise(agent, env, num_episodes, noise_scale):
policy_params = agent.policy.get_weights() # Get original weights
for episode in range(num_episodes):
# Sample parameter noise for the episode
param_noise_vector = np.random.normal(0, noise_scale, size=policy_params.shape)
# Create perturbed policy for exploration
perturbed_policy_params = policy_params + param_noise_vector
agent.exploration_policy.set_weights(perturbed_policy_params) # Use a separate policy instance for exploration
state = env.reset()
done = False
episode_reward = 0
while not done:
# Act using the perturbed policy
action = agent.exploration_policy.predict(state)
next_state, reward, done, _ = env.step(action)
# Store experience using the actual action taken
agent.memory.store(state, action, reward, next_state, done)
# Learn using the original (non-perturbed) policy parameters
if agent.memory.is_ready():
agent.learn() # Uses original policy parameters for updates
state = next_state
episode_reward += reward
# Update original policy parameters based on learning step
policy_params = agent.policy.get_weights()
# Optional: Adapt noise_scale based on action distance metric
# noise_scale = adapt_noise_scale(...)
print(f"Episode: {episode}, Reward: {episode_reward}")
While effective, parameter space noise introduces its own set of challenges:
Parameter space noise offers a sophisticated mechanism for driving exploration, particularly suited for scenarios where temporally consistent exploratory behavior is beneficial, such as in continuous control tasks solved with deterministic policy gradient methods. It represents a valuable alternative or complement to action space noise and intrinsic motivation techniques in the toolkit for designing effective exploring agents.
© 2025 ApX Machine Learning