While methods like Upper Confidence Bounds (UCB) tackle exploration by acting optimistically in the face of uncertainty, Thompson Sampling (TS), also known as posterior sampling, offers a different, probabilistically grounded approach. Instead of selecting the action with the highest optimistic estimate, Thompson Sampling selects actions according to their probability of being the optimal action, given the currently available data. It elegantly balances exploration and exploitation through Bayesian reasoning.
The Bayesian Heart of Thompson Sampling
At its core, Thompson Sampling is a Bayesian algorithm. It maintains a belief, represented as a probability distribution (the posterior distribution), over the unknown quantity of interest. In the context of Reinforcement Learning, this quantity is often the value of taking a specific action in a given state, Q(s,a), or perhaps the parameters of a model predicting this value.
The general procedure follows these steps:
- Maintain Beliefs: For each possible action a (in a given state s, if context-dependent), maintain a posterior probability distribution P(θa∣History) over the parameters θa that determine the action's value or reward. Initially, this is based on a prior distribution, reflecting initial beliefs before any interaction.
- Sample from Beliefs: At each decision point, sample a parameter value θ~a from the current posterior distribution for each action a.
- Act: Calculate the expected value (e.g., Q-value or immediate reward) associated with each sampled parameter θ~a. Select the action a∗ that yields the highest sampled value.
a∗=aargmax f(θ~a)
where f(θ~a) is the value derived from the sampled parameters (e.g., a sampled Q-value).
- Update Beliefs: Execute action a∗, observe the outcome (reward r, next state s′), and update the posterior distribution for the chosen action a∗ using Bayes' theorem, incorporating the new observation.
P(θa∗∣Historynew)∝P(Observation∣θa∗)×P(θa∗∣Historyold)
This cycle repeats. Actions whose posterior distributions indicate a high probability of being optimal will be selected more often (exploitation). Actions whose posterior distributions are wide (high uncertainty) but potentially include high values will also be sampled occasionally, leading to exploration. The probability of selecting an action directly matches the algorithm's current belief in that action being the best one.
Conceptual flow of Thompson Sampling in an RL context.
Thompson Sampling in Reinforcement Learning
Applying Thompson Sampling directly to estimate Q(s,a) in large state spaces presents challenges, particularly when using deep neural networks as function approximators. Maintaining and sampling from exact posterior distributions over the weights of a deep network, or directly over the Q-values themselves, is often computationally intractable.
Several approximations have been developed:
- Bootstrapped DQN: While primarily known for ensemble methods, the bootstrapping approach can be used to approximate a posterior distribution. Training multiple Q-network "heads" on different bootstrapped samples of the experience replay buffer provides a diversity of estimates. At decision time, one head is randomly selected, and the agent acts greedily with respect to that head's Q-values. This approximates sampling from the posterior.
- Bayesian Neural Networks (BNNs): Techniques like variational inference (VI) or Markov Chain Monte Carlo (MCMC) methods can be used to approximate the posterior distribution over network weights. Sampling weights from this approximate posterior yields samples of the Q-function. However, BNNs add significant computational overhead.
- Dropout Approximation: Using dropout not just during training but also at test time (sometimes called Monte Carlo Dropout) can be interpreted as an approximation to Bayesian inference. Performing multiple forward passes with different dropout masks generates samples from an approximate predictive distribution of Q-values.
- Specific Parametric Distributions: Assume a specific form for the Q-value distribution (e.g., Gaussian) and learn its parameters (mean and variance). This simplifies maintaining the belief but relies on the validity of the distributional assumption.
Comparison with UCB
- Mechanism: UCB deterministically selects actions based on an upper confidence bound, explicitly calculating a bonus for uncertainty. TS stochastically selects actions by sampling from the posterior belief distribution; exploration arises implicitly from the uncertainty captured in the distribution's width.
- Performance: Empirically, Thompson Sampling often performs very well, sometimes better than UCB variants, particularly in non-stationary environments or when prior knowledge is effectively incorporated.
- Complexity: Implementing exact TS can be more complex than UCB, especially in deep RL due to the challenges of maintaining and sampling from posteriors. Approximate methods like bootstrapping or dropout reduce this complexity but introduce their own approximations.
Advantages and Disadvantages
Advantages:
- Principled Bayesian Approach: Provides a coherent framework for incorporating prior knowledge and updating beliefs based on evidence.
- Effective Exploration: Often achieves state-of-the-art performance by efficiently directing exploration towards uncertain but potentially rewarding actions.
- Natural Balancing: The probability matching mechanism inherently balances exploration and exploitation without needing explicit bonus terms like UCB.
Disadvantages:
- Computational Cost: Maintaining and sampling from posterior distributions can be computationally expensive, especially for complex models like deep neural networks.
- Approximation Challenges: Practical implementations often rely on approximations (like bootstrapping or variational inference) which might deviate from the true Bayesian posterior.
- Sensitivity to Priors: The choice of prior distribution can influence initial performance, although this influence typically diminishes as more data is collected.
Thompson Sampling represents a powerful, statistically grounded method for tackling the exploration-exploitation dilemma. While exact implementations can be challenging in complex RL settings, its conceptual elegance and strong empirical performance, often achieved via practical approximations, make it a significant tool in the advanced RL practitioner's arsenal.