In a separate entry, we have described Monte Carlo search as an instance of search with an n-step look-ahead that uses random rollouts to evaluate non-terminal states at the leaves. One of the disadvantages of such Monte Carlo search is that it devotes the same amount of attention to all parts of the search tree. In some cases, even a few rollouts can clearly indicate that a certain branch does not lead to good results. A method has therefore been developed, which only performs a single rollout at a time and records the results in the form of a gradually growing search tree. The values already recorded in the tree will then help to identify the parts of the search tree that it would be the most useful to search next [essentials].
MCTS is a very powerful algorithm, which is commonly used in automatic players. It can be combined with heuristics or with reinforcement learning to get even stronger results. It also forms part of modern players such as AlphaGo and AlphaZero [alphago,alphazero], which combine it with deep reinforcement learning.
MCTS first inserts the root node for the initial state s0 into the search tree. It then keeps iterating through the following four steps [essentials]:
- Selection: The node with the greatest potential is found in the search tree (e.g. the number of visits to the node and its estimated value are taken into account). We are looking for a node the subnodes of which are not in the search tree yet (it has not been expanded yet).
- Expansion: The selected node is expanded, i.e. its subnodes are added into the search tree and one of them is randomly selected.
- Simulation: A rollout is performed from the selected node: picking random actions for the agents until a terminal state is reached. In the terminal state the utility (win/loss, scores, …) of the individual players is determined.
- Update: We use the computed utility to estimate the state’s value: the utility is backpropagated from the leaf node back to the root node; we also increment the number of visitations for all the nodes concerned.
A popular criterion for a node’s potential in the selection step, is the UCT (upper confidence bound applied to trees) criterion that is also used to balance exploration and exploitation in methods for bandit problems.
Monte Carlo tree search (according to [mcts], with modifications).
Literature
- [alphago] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), pp.484-489.
- [alphazero] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T. and Lillicrap, T., 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
- [essentials] Leyton-Brown, K. and Shoham, Y., 2008. Essentials of game theory: A concise multidisciplinary introduction. Synthesis lectures on artificial intelligence and machine learning, 2(1), pp.1-88.
- [mcts] Chaslot, G., Bakkes, S., Szita, I. and Spronck, P., 2008, October. Monte-Carlo Tree Search: A New Framework for Game AI. In AIIDE.