Bandit Problems
(Multi-armed) bandit problems represent a class of decision making problems where the decisions have a one-time nature: the problem does not involve a sequence of actions. Examples may include selecting which advert banner to show on your website to attract more clicks and such.
The One-Armed and Multi-Armed Bandit
The name comes from a slot machine such as you might encounter in Las Vegas. A one armed bandit is a machine with a single lever. You would insert a coin, pull a lever and collect your winnings or not depending on the outcome [aima]. A multi-armed bandit would be a slot machine with multiple arms: in that case we would like to know which arm tends to yield the greatest amount of winnings on average.
Exploration vs. Exploitation
However, in order to determine which arm yields the greatest rewards, we need to spend time and money. While playing, we therefore need to balance exploration (learning more about the possible outcomes of all the individual levers) and exploitation (making use of what we already know to maximize our winnings). This is commonly referred to as the exploration vs. exploitation trade-off.
Contextual Bandits
Bandit problems can be:
- Context-free: the agent is given no input and just need to select an action (e.g. select the banner);
- Contextual: the agent is also given some input as context (e.g. some information about the user that is going to see the banner).
For further reference see also tutorial [thompson_tutorial] or textbook [bandit_textbook] which also covers topics such as Lipschitz bandits, contextual bandits, adversarial bandits and more.
Literature
- [aima] Russell, S.J. and Norvig, P., 2010. Artificial Intelligence-A Modern Approach, Third International Edition.
- [thompson_tutorial] Russo, D., Van Roy, B., Kazerouni, A., Osband, I. and Wen, Z., 2017. A tutorial on thompson sampling. arXiv preprint arXiv:1707.02038.
- [bandit_textbook] Slivkins, A., 2019. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272.