While simple exploration strategies like -greedy encourage random actions, they often prove insufficient in environments with extensive state spaces or sparse rewards. Such methods explore inefficiently, potentially missing critical areas of the state space. Count-based exploration offers a more directed approach by explicitly incentivizing the agent to visit states or state-action pairs that it hasn't encountered frequently. The underlying principle is simple: "if you haven't been here often, it might be interesting."
Imagine an agent navigating a maze. An -greedy agent might repeatedly stumble down familiar corridors. A count-based agent, however, would feel an "attraction" towards paths it hasn't explored much. This attraction comes in the form of an exploration bonus added to the environment's reward signal. The less a state (or state-action pair) has been visited, the higher the bonus for visiting it again.
In simple environments with a finite, manageable number of states and actions (the "tabular" case), we can directly count visits. Let be the number of times state has been visited, or be the number of times action has been taken in state . A common form for the exploration bonus added to the extrinsic reward is based on the principle of "Optimism in the Face of Uncertainty":
Here, is a hyperparameter controlling the intensity of exploration. The total reward used for learning becomes (or ). As a state is visited more often, increases, and the exploration bonus diminishes, causing the agent to eventually focus on exploiting the learned policy based on the actual rewards . Algorithms like MBIE-EB (Model-Based Interval Estimation with Exploration Bonus) formalize this idea.
Direct counting breaks down in large or continuous state spaces. Why? Because the agent might never visit the exact same high-dimensional state (like an image from a camera) twice. We need a way to generalize the idea of visitation counts to unseen but similar states. This leads to the concept of pseudo-counts.
Instead of exact counts, we use a density model trained on the history of visited states. This model estimates the probability density around any given state .
The core idea is to derive a pseudo-count from this density estimate . A pseudo-count approximates how many times states "similar" to have been visited.
How can we connect density to a pseudo-count ? Consider a sequence of states . A density model trained on this sequence might learn . Now, if we observe a new state , we can ask: what is the probability according to the model? This probability reflects how "familiar" the new state is.
A critical insight comes from relating the probability assigned by the density model to the probability assigned by simply recording and predicting states . It can be shown that under certain assumptions about the density model learning process, the prediction probability relates to a pseudo-count via:
where is a small smoothing constant. More practically relevant is the relationship derived from the increase in density upon observing again. Let be the density estimate after observations, and be the estimate after observing one more time. The pseudo-count should satisfy:
This means observing a state again increases its density more if its current pseudo-count is low. While calculating this difference directly is often complex, several density estimation methods allow us to approximate or directly use to define an exploration bonus.
The exploration bonus is then calculated similarly to the tabular case, but using the pseudo-count:
This bonus encourages the agent to move towards states that the density model considers novel (low density, low pseudo-count).
Several techniques exist to implement density estimation for pseudo-counts in high-dimensional spaces:
Hashing-Based Methods: These methods use hash functions or learned feature embeddings to map high-dimensional states to a lower-dimensional (often discrete) representation . Counts are then maintained in this compressed space.
Generative Models: Deep generative models, trained on the stream of observed states , can naturally estimate density or novelty.
Context-Tree Switching (CTS) Models: These are powerful sequence models that can build adaptive density estimates for transitions . They have been used effectively for pseudo-count estimation in RL.
Regardless of how the pseudo-count (or a proxy for novelty) is calculated, the exploration bonus is typically added to the extrinsic reward received from the environment. The learning algorithm (e.g., DQN, PPO) then uses the combined reward to update its value functions or policies.
Flow diagram illustrating how pseudo-counts generate exploration bonuses within an RL loop. The state is fed into a density model, which yields a pseudo-count used to compute a bonus. This bonus modifies the reward signal used by the RL agent.
Count-based exploration using pseudo-counts provides a principled and often effective way to drive exploration in complex environments. By formalizing the intuition of visiting less familiar states, these methods enable agents to systematically gather information and tackle challenging problems where random exploration fails. They represent a significant advance over simple heuristics, pushing towards more intelligent exploration behavior.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with