Exploration is a topic relevant to multiple fields of machine learning and decision making, but especially to the two closely related areas of bandit problem solving and reinforcement learning, which both need to address the exploration vs. exploitation trade-off.

Exploration vs. Exploitation

In order to determine which arm yields the greatest rewards in a bandit problem, or which action might lead to the greatest total rewards in reinforcement learning, a learning system needs to experiment with different actions. It needs to spend time and other resources exploring various options.

However, once it has learnt something, the learning system is faced with two mutually opposed courses of action: should it try to make use of the knowledge it already has to maximize its rewards (exploitation) or should it continue trying new kinds of behaviour even though there is a risk it will be suboptimal (exploration)? It is obvious that a learning system will need to do at least some exploration to learn anything at all, but how much exploration is the right amount?

This choice is commonly referred to as the exploration vs. exploitation trade-off.

Approximate Approaches

While one could theoretically be looking for an optimal solution to the exploration vs. exploitation trade-off, this is generally extremely difficult to do. However, there is a number of approximate approaches [aima].

A brief overview of some basic approaches, namely the:

  • ε-Greedy Algorithm;
  • UCB (Upper Confidence Bound) and Bayesian UCB;
  • Thompson Sampling.

along with sample implementations in Python can be found at [lilianweng].

Other approaches, somewhat popular in the area of reinforcement learning, include heuristic methods motivated by psychology, e.g. artificial curiosity, wherein agents are made curious so that they keep returning to states with surprising transitions, where there is presumably a lot of potential to learn interesting things about the environment. Recent works of this kind include e.g. [curiosity1, curiosity2].

Literature

  1. [aima] Russell, S.J. and Norvig, P., 2010. Artificial Intelligence-A Modern Approach, Third International Edition.
  2. [lilianweng] Lilian Weng. The Multi-Armed Bandit Problem and Its Solutions. https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html#upper-confidence-bounds
  3. [curiosity1] Pathak, D., Agrawal, P., Efros, A.A. and Darrell, T., 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 16-17).
  4. [curiosity2] Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T. and Efros, A.A., 2018. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355.