While simple exploration strategies like ϵ-greedy encourage random actions, they often prove insufficient in environments with vast state spaces or sparse rewards. Such methods explore inefficiently, potentially missing critical areas of the state space. Count-based exploration offers a more directed approach by explicitly incentivizing the agent to visit states or state-action pairs that it hasn't encountered frequently. The underlying principle is simple: "if you haven't been here often, it might be interesting."
Imagine an agent navigating a maze. An ϵ-greedy agent might repeatedly stumble down familiar corridors. A count-based agent, however, would feel an "attraction" towards paths it hasn't explored much. This attraction comes in the form of an exploration bonus added to the environment's reward signal. The less a state (or state-action pair) has been visited, the higher the bonus for visiting it again.
In simple environments with a finite, manageable number of states and actions (the "tabular" case), we can directly count visits. Let N(s) be the number of times state s has been visited, or N(s,a) be the number of times action a has been taken in state s. A common form for the exploration bonus r+ added to the extrinsic reward re is based on the principle of "Optimism in the Face of Uncertainty":
r+(s)=N(s)βorr+(s,a)=N(s,a)βHere, β is a hyperparameter controlling the intensity of exploration. The total reward used for learning becomes r=re+r+(s) (or r=re+r+(s,a)). As a state is visited more often, N(s) increases, and the exploration bonus diminishes, causing the agent to eventually focus on exploiting the learned policy based on the actual rewards re. Algorithms like MBIE-EB (Model-Based Interval Estimation with Exploration Bonus) formalize this idea.
Direct counting breaks down in large or continuous state spaces. Why? Because the agent might never visit the exact same high-dimensional state (like an image from a camera) twice. We need a way to generalize the notion of visitation counts to unseen but similar states. This leads to the concept of pseudo-counts.
Instead of exact counts, we use a density model p(s) trained on the history of visited states. This model estimates the probability density around any given state s.
The core idea is to derive a pseudo-count N^(s) from this density estimate p(s). A pseudo-count N^(s) approximates how many times states "similar" to s have been visited.
How can we connect density p(s) to a pseudo-count N^(s)? Consider a sequence of states s1,s2,...,sn. A density model trained on this sequence might learn p(s). Now, if we observe a new state sn+1, we can ask: what is the probability p(sn+1) according to the model? This probability reflects how "familiar" the new state is.
A key insight comes from relating the probability assigned by the density model p(s) to the probability assigned by simply recording and predicting states pREC(s). It can be shown that under certain assumptions about the density model learning process, the prediction probability p(s) relates to a pseudo-count N^(s) via:
p(s)≈∑s′(N^(s′)+δ)N^(s)+δwhere δ is a small smoothing constant. More practically relevant is the relationship derived from the increase in density upon observing s again. Let pn(s) be the density estimate after n observations, and pn+1(s) be the estimate after observing s one more time (conceptually). The pseudo-count N^(s) should satisfy:
pn+1(s)−pn(s)∝N^(s)1This means observing a state again increases its density more if its current pseudo-count is low. While calculating this difference directly is often complex, several density estimation methods allow us to approximate N^(s) or directly use p(s) to define an exploration bonus.
The exploration bonus is then calculated similarly to the tabular case, but using the pseudo-count:
r+(s)=N^(s)βThis bonus encourages the agent to move towards states that the density model considers novel (low density, low pseudo-count).
Several techniques exist to implement density estimation for pseudo-counts in high-dimensional spaces:
Hashing-Based Methods: These methods use hash functions or learned feature embeddings ϕ(s) to map high-dimensional states s to a lower-dimensional (often discrete) representation z=ϕ(s). Counts N(z) are then maintained in this compressed space.
Generative Models: Deep generative models, trained on the stream of observed states st, can naturally estimate density or novelty.
Context-Tree Switching (CTS) Models: These are powerful sequence models that can build adaptive density estimates for transitions p(s′∣s,a). They have been used effectively for pseudo-count estimation in RL.
Regardless of how the pseudo-count N^(s) (or a proxy for novelty) is calculated, the exploration bonus r+=β/N^(s) is typically added to the extrinsic reward re received from the environment. The learning algorithm (e.g., DQN, PPO) then uses the combined reward r=re+r+ to update its value functions or policies.
Flow diagram illustrating how pseudo-counts generate exploration bonuses within an RL loop. The state is fed into a density model, which yields a pseudo-count used to compute a bonus. This bonus modifies the reward signal used by the RL agent.
Count-based exploration using pseudo-counts provides a principled and often effective way to drive exploration in complex environments. By formalizing the intuition of visiting less familiar states, these methods enable agents to systematically gather information and tackle challenging problems where random exploration fails. They represent a significant step beyond simple heuristics, pushing towards more intelligent exploration behavior.
© 2025 ApX Machine Learning