Skip to main content

2022 | OriginalPaper | Buchkapitel

6. Two-Agent Self-Play

verfasst von : Aske Plaat

Erschienen in: Deep Reinforcement Learning

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Previous chapters were concerned with how a single agent can learn optimal behavior for its environment. This chapter is different. We turn to problems where two agents operate whose behavior will both be modeled (and, in the next chapter, more than two).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A modern reimplementation of TD-Gammon in TensorFlow is available on GitHub at TD-Gammon https://​github.​com/​fomorians/​td-gammon.
 
2
For example, the maximal state space of tic-tac-toe is 39 = 19683 positions (9 squares of “X,” “O,” or blank), where only 765 positions remain if we remove symmetrical and illegal positions [96].
 
3
Drosophila Melanogaster is also known as the fruitfly, a favorite species of genetics researchers to test their theories, because experiments produce quick and clear answers.
 
4
Absolute beginners in Go start at 30 kyu, progressing to 10 kyu and advancing to 1 kyu (30k–1k). Stronger amateur players then achieve 1 dan, progressing to 7 dan, the highest amateur rating for Go (1d–7d). Professional Go players have a rating from 1 dan to 9 dan, written as 1p–9p.
 
5
There is also research into opponent modeling, where we try to exploit our opponent’s weaknesses [14, 47, 54]. Here, we assume an identical opponent, which often works best in chess and Go.
 
6
Because the agent knows the transition function T, it can calculate the new state s′ for each action a. The reward r is calculated at terminal states, where it is equal to the value v. Hence, in this diagram, the search function provides the state to the eval function. See [87, 125] for an explanation of the search-eval architecture.
 
7
The heuristic evaluation function is originally a linear combination of hand-crafted heuristic rules, such as material balance (which side has more pieces) or center control. At first, the linear combinations (coefficients) were not only hand-coded but also hand-tuned. Later they were trained by supervised learning [10, 46, 91, 120]. More recently, NNUE was introduced as a nonlinear neural network to use as evaluation function in an alpha–beta framework [81].
 
8
Compare chess and Go: in chess, the typical number of moves in a position is 25, and for Go this number is 250. A chess-tree of depth 5 has 255 = 9765625 leaves. A Go-tree of depth 5 has 2505 = 976562500000 leaves. A depth-5 minimax search in Go would take prohibitively long; an MCTS search of 1000 expansions expands the same number of paths from root to leaf in both games.
 
9
Originally, playouts were random (the Monte Carlo part in the name of MCTS) following Brügmann’s [18] and Bouzy and Helmstetter’s [15] original approach. In practice, most Go playing programs improve on the random playouts by using databases of small 3 × 3 patterns with best replies and other fast heuristics [24, 31, 33, 50, 106]. Small amounts of domain knowledge are used after all, albeit not in the form of a heuristic evaluation function.
 
11
The square root term is a measure of the variance (uncertainty) of the action value. The use of the natural logarithm ensures that, since increases get smaller over time, old actions are selected less frequently. However, since logarithm values are unbounded, eventually all actions will be selected [114].
 
12
Note further the small differences under the square root (no logarithm, and the 1 in the denominator) also change the UCT function profile somewhat, ensuring correct behavior at unvisited actions [77].
 
13
Such a sequence of related learning tasks corresponds to a meta-learning problem. In meta-learning, the aim is to learn a new task fast, by using the knowledge learned from previous, related, tasks; see Chap. 9.
 
14
See also generative adversarial networks and deep dreaming, for a connectionist approach to content generation, Sect. B.​2.​6.​2.
 
15
TPU stands for tensor processing unit, a low-precision design specifically developed for fast neural network processing.
 
16
The basis of the Elo rating is pairwise comparison [42]. Elo is often used to compare playing strength in board games.
 
17
Treat as if human.
 
18
Although an AlphaZero version that has learned to play Go cannot play chess, it has to re-learn chess from scratch, with different input and output layers.
 
Literatur
1.
Zurück zum Zitat Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018. Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
2.
Zurück zum Zitat Bruce Abramson. Expected-outcome: A general model of static evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2):182–193, 1990.CrossRef Bruce Abramson. Expected-outcome: A general model of static evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2):182–193, 1990.CrossRef
3.
Zurück zum Zitat Anonymous. Go AI strength vs. time. Reddit post, 2017. Anonymous. Go AI strength vs. time. Reddit post, 2017.
4.
Zurück zum Zitat Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370, 2017. Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370, 2017.
5.
Zurück zum Zitat Oleg Arenz. Monte Carlo Chess. Master’s thesis, Universität Darmstadt, 2012. Oleg Arenz. Monte Carlo Chess. Master’s thesis, Universität Darmstadt, 2012.
6.
Zurück zum Zitat Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.MathSciNetMATH Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.MathSciNetMATH
7.
Zurück zum Zitat Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002.CrossRef Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002.CrossRef
8.
Zurück zum Zitat Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1–2):55–65, 2010.MathSciNetCrossRef Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1–2):55–65, 2010.MathSciNetCrossRef
9.
Zurück zum Zitat Jonathan Baxter, Andrew Tridgell, and Lex Weaver. Knightcap: a chess program that learns by combining TD (λ) with game-tree search. arXiv preprint cs/9901002, 1999. Jonathan Baxter, Andrew Tridgell, and Lex Weaver. Knightcap: a chess program that learns by combining TD (λ) with game-tree search. arXiv preprint cs/9901002, 1999.
10.
Zurück zum Zitat Jonathan Baxter, Andrew Tridgell, and Lex Weaver. Learning to play chess using temporal differences. Machine Learning, 40(3):243–263, 2000.CrossRef Jonathan Baxter, Andrew Tridgell, and Lex Weaver. Learning to play chess using temporal differences. Machine Learning, 40(3):243–263, 2000.CrossRef
11.
Zurück zum Zitat Don Beal and Martin C. Smith. Temporal difference learning for heuristic search and game playing. Information Sciences, 122(1):3–21, 2000.CrossRef Don Beal and Martin C. Smith. Temporal difference learning for heuristic search and game playing. Information Sciences, 122(1):3–21, 2000.CrossRef
12.
Zurück zum Zitat Laurens Beljaards. AI agents for the abstract strategy game Tak. Master’s thesis, Leiden University, 2017. Laurens Beljaards. AI agents for the abstract strategy game Tak. Master’s thesis, Leiden University, 2017.
13.
Zurück zum Zitat Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48, 2009. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48, 2009.
14.
Zurück zum Zitat Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. Opponent modeling in poker. AAAI/IAAI, 493:499, 1998. Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. Opponent modeling in poker. AAAI/IAAI, 493:499, 1998.
15.
Zurück zum Zitat Bruno Bouzy and Bernard Helmstetter. Monte Carlo Go developments. In Advances in Computer Games, pages 159–174. Springer, 2004. Bruno Bouzy and Bernard Helmstetter. Monte Carlo Go developments. In Advances in Computer Games, pages 159–174. Springer, 2004.
16.
Zurück zum Zitat Cameron Browne. Hex Strategy. AK Peters/CRC Press, 2000.MATH Cameron Browne. Hex Strategy. AK Peters/CRC Press, 2000.MATH
17.
Zurück zum Zitat Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo Tree Search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012. Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo Tree Search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
18.
Zurück zum Zitat Bernd Brügmann. Monte Carlo Go. Technical report, Syracuse University, 1993. Bernd Brügmann. Monte Carlo Go. Technical report, Syracuse University, 1993.
19.
Zurück zum Zitat Andres Campero, Roberta Raileanu, Heinrich Küttler, Joshua B Tenenbaum, Tim Rocktäschel, and Edward Grefenstette. Learning with AMIGo: Adversarially motivated intrinsic goals. In International Conference on Learning Representations, 2020. Andres Campero, Roberta Raileanu, Heinrich Küttler, Joshua B Tenenbaum, Tim Rocktäschel, and Edward Grefenstette. Learning with AMIGo: Adversarially motivated intrinsic goals. In International Conference on Learning Representations, 2020.
20.
Zurück zum Zitat Tristan Cazenave. Residual networks for computer Go. IEEE Transactions on Games, 10(1):107–110, 2018.CrossRef Tristan Cazenave. Residual networks for computer Go. IEEE Transactions on Games, 10(1):107–110, 2018.CrossRef
21.
Zurück zum Zitat Tristan Cazenave, Yen-Chi Chen, Guan-Wei Chen, Shi-Yu Chen, Xian-Dong Chiu, Julien Dehos, Maria Elsa, Qucheng Gong, Hengyuan Hu, Vasil Khalidov, Cheng-Ling Li, Hsin-I Lin, Yu-Jin Lin, Xavier Martinet, Vegard Mella, Jérémy Rapin, Baptiste Rozière, Gabriel Synnaeve, Fabien Teytaud, Olivier Teytaud, Shi-Cheng Ye, Yi-Jun Ye, Shi-Jim Yen, and Sergey Zagoruyko. Polygames: Improved zero learning. arXiv preprint arXiv:2001.09832, 2020. Tristan Cazenave, Yen-Chi Chen, Guan-Wei Chen, Shi-Yu Chen, Xian-Dong Chiu, Julien Dehos, Maria Elsa, Qucheng Gong, Hengyuan Hu, Vasil Khalidov, Cheng-Ling Li, Hsin-I Lin, Yu-Jin Lin, Xavier Martinet, Vegard Mella, Jérémy Rapin, Baptiste Rozière, Gabriel Synnaeve, Fabien Teytaud, Olivier Teytaud, Shi-Cheng Ye, Yi-Jun Ye, Shi-Jim Yen, and Sergey Zagoruyko. Polygames: Improved zero learning. arXiv preprint arXiv:2001.09832, 2020.
22.
Zurück zum Zitat Tristan Cazenave and Bernard Helmstetter. Combining tactical search and Monte-Carlo in the game of Go. In Proceedings of the 2005 IEEE Symposium on Computational Intelligence and Games (CIG05), Essex University, volume 5, pages 171–175, 2005. Tristan Cazenave and Bernard Helmstetter. Combining tactical search and Monte-Carlo in the game of Go. In Proceedings of the 2005 IEEE Symposium on Computational Intelligence and Games (CIG05), Essex University, volume 5, pages 171–175, 2005.
23.
Zurück zum Zitat Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53(1):126–139, 2005. Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53(1):126–139, 2005.
24.
Zurück zum Zitat Guillaume Chaslot. Monte-Carlo tree search. PhD thesis, Maastricht University, 2010. Guillaume Chaslot. Monte-Carlo tree search. PhD thesis, Maastricht University, 2010.
25.
Zurück zum Zitat Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-Carlo tree search: A new framework for game AI. In AIIDE, 2008. Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-Carlo tree search: A new framework for game AI. In AIIDE, 2008.
27.
Zurück zum Zitat Christopher Clark and Amos Storkey. Teaching deep convolutional neural networks to play Go. arxiv preprint. arXiv preprint arXiv:1412.3409, 1, 2014. Christopher Clark and Amos Storkey. Teaching deep convolutional neural networks to play Go. arxiv preprint. arXiv preprint arXiv:1412.3409, 1, 2014.
28.
Zurück zum Zitat Christopher Clark and Amos Storkey. Training deep convolutional neural networks to play Go. In International Conference on Machine Learning, pages 1766–1774, 2015. Christopher Clark and Amos Storkey. Training deep convolutional neural networks to play Go. In International Conference on Machine Learning, pages 1766–1774, 2015.
29.
Zurück zum Zitat Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International Conference on Machine Learning, pages 2048–2056. PMLR, 2020. Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International Conference on Machine Learning, pages 2048–2056. PMLR, 2020.
30.
Zurück zum Zitat Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo Tree Search. In International Conference on Computers and Games, pages 72–83. Springer, 2006. Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo Tree Search. In International Conference on Computers and Games, pages 72–83. Springer, 2006.
31.
Zurück zum Zitat Rémi Coulom. Monte-Carlo tree search in Crazy Stone. In Proceedings Game Programming Workshop, Tokyo, Japan, pages 74–75, 2007. Rémi Coulom. Monte-Carlo tree search in Crazy Stone. In Proceedings Game Programming Workshop, Tokyo, Japan, pages 74–75, 2007.
32.
Zurück zum Zitat Rémi Coulom. The Monte-Carlo revolution in Go. In The Japanese-French Frontiers of Science Symposium (JFFoS 2008), Roscoff, France, 2009. Rémi Coulom. The Monte-Carlo revolution in Go. In The Japanese-French Frontiers of Science Symposium (JFFoS 2008), Roscoff, France, 2009.
33.
Zurück zum Zitat Joseph C Culberson and Jonathan Schaeffer. Pattern databases. Computational Intelligence, 14(3):318–334, 1998. Joseph C Culberson and Jonathan Schaeffer. Pattern databases. Computational Intelligence, 14(3):318–334, 1998.
34.
Zurück zum Zitat Wojciech Marian Czarnecki, Gauthier Gidel, Brendan Tracey, Karl Tuyls, Shayegan Omidshafiei, David Balduzzi, and Max Jaderberg. Real world games look like spinning tops. In Advances in Neural Information Processing Systems, 2020. Wojciech Marian Czarnecki, Gauthier Gidel, Brendan Tracey, Karl Tuyls, Shayegan Omidshafiei, David Balduzzi, and Max Jaderberg. Real world games look like spinning tops. In Advances in Neural Information Processing Systems, 2020.
36.
Zurück zum Zitat Shubhomoy Das, Weng-Keen Wong, Thomas Dietterich, Alan Fern, and Andrew Emmott. Incorporating expert feedback into active anomaly discovery. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 853–858. IEEE, 2016. Shubhomoy Das, Weng-Keen Wong, Thomas Dietterich, Alan Fern, and Andrew Emmott. Incorporating expert feedback into active anomaly discovery. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 853–858. IEEE, 2016.
37.
Zurück zum Zitat Dave De Jonge, Tim Baarslag, Reyhan Aydoğan, Catholijn Jonker, Katsuhide Fujita, and Takayuki Ito. The challenge of negotiation in the game of diplomacy. In International Conference on Agreement Technologies, pages 100–114. Springer, 2018. Dave De Jonge, Tim Baarslag, Reyhan Aydoğan, Catholijn Jonker, Katsuhide Fujita, and Takayuki Ito. The challenge of negotiation in the game of diplomacy. In International Conference on Agreement Technologies, pages 100–114. Springer, 2018.
38.
Zurück zum Zitat Thang Doan, Joao Monteiro, Isabela Albuquerque, Bogdan Mazoure, Audrey Durand, Joelle Pineau, and R Devon Hjelm. On-line adaptative curriculum learning for GANs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3470–3477, 2019. Thang Doan, Joao Monteiro, Isabela Albuquerque, Bogdan Mazoure, Audrey Durand, Joelle Pineau, and R Devon Hjelm. On-line adaptative curriculum learning for GANs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3470–3477, 2019.
39.
Zurück zum Zitat Christian Donninger. Null move and deep search. ICGA Journal, 16(3):137–143, 1993.CrossRef Christian Donninger. Null move and deep search. ICGA Journal, 16(3):137–143, 1993.CrossRef
40.
Zurück zum Zitat Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016. Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
41.
Zurück zum Zitat Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993. Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.
42.
Zurück zum Zitat Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978. Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
43.
Zurück zum Zitat Markus Enzenberger, Martin Muller, Broderick Arneson, and Richard Segal. Fuego—an open-source framework for board games and Go engine based on Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):259–270, 2010.CrossRef Markus Enzenberger, Martin Muller, Broderick Arneson, and Richard Segal. Fuego—an open-source framework for board games and Go engine based on Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):259–270, 2010.CrossRef
44.
Zurück zum Zitat Dieqiao Feng, Carla P Gomes, and Bart Selman. Solving hard AI planning instances using curriculum-driven deep reinforcement learning. arXiv preprint arXiv:2006.02689, 2020. Dieqiao Feng, Carla P Gomes, and Bart Selman. Solving hard AI planning instances using curriculum-driven deep reinforcement learning. arXiv preprint arXiv:2006.02689, 2020.
45.
Zurück zum Zitat Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pages 1515–1528. PMLR, 2018. Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pages 1515–1528. PMLR, 2018.
46.
Zurück zum Zitat David B Fogel, Timothy J Hays, Sarah L Hahn, and James Quon. Further evolution of a self-learning chess program. In Computational Intelligence in Games, 2005. David B Fogel, Timothy J Hays, Sarah L Hahn, and James Quon. Further evolution of a self-learning chess program. In Computational Intelligence in Games, 2005.
47.
Zurück zum Zitat Sam Ganzfried and Tuomas Sandholm. Game theory-based opponent modeling in large imperfect-information games. In The 10th International Conference on Autonomous Agents and Multiagent Systems, volume 2, pages 533–540, 2011. Sam Ganzfried and Tuomas Sandholm. Game theory-based opponent modeling in large imperfect-information games. In The 10th International Conference on Autonomous Agents and Multiagent Systems, volume 2, pages 533–540, 2011.
48.
Zurück zum Zitat Sylvain Gelly, Levente Kocsis, Marc Schoenauer, Michele Sebag, David Silver, Csaba Szepesvári, and Olivier Teytaud. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3):106–113, 2012.CrossRef Sylvain Gelly, Levente Kocsis, Marc Schoenauer, Michele Sebag, David Silver, Csaba Szepesvári, and Olivier Teytaud. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3):106–113, 2012.CrossRef
49.
Zurück zum Zitat Sylvain Gelly and David Silver. Achieving master level play in 9 × 9 computer Go. In AAAI, volume 8, pages 1537–1540, 2008. Sylvain Gelly and David Silver. Achieving master level play in 9 × 9 computer Go. In AAAI, volume 8, pages 1537–1540, 2008.
50.
Zurück zum Zitat Sylvain Gelly, Yizao Wang, and Olivier Teytaud. Modification of UCT with patterns in Monte-Carlo Go. Technical Report RR-6062, INRIA, 2006. Sylvain Gelly, Yizao Wang, and Olivier Teytaud. Modification of UCT with patterns in Monte-Carlo Go. Technical Report RR-6062, INRIA, 2006.
51.
Zurück zum Zitat Tobias Graf and Marco Platzner. Adaptive playouts in Monte-Carlo tree search with policy-gradient reinforcement learning. In Advances in Computer Games, pages 1–11. Springer, 2015. Tobias Graf and Marco Platzner. Adaptive playouts in Monte-Carlo tree search with policy-gradient reinforcement learning. In Advances in Computer Games, pages 1–11. Springer, 2015.
52.
Zurück zum Zitat Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Rémi Munos. Monte-Carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pages 3769–3778. PMLR, 2020. Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Rémi Munos. Monte-Carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pages 3769–3778. PMLR, 2020.
53.
Zurück zum Zitat Ryan B Hayward and Bjarne Toft. Hex: The Full Story. CRC Press, 2019. Ryan B Hayward and Bjarne Toft. Hex: The Full Story. CRC Press, 2019.
54.
Zurück zum Zitat He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pages 1804–1813. PMLR, 2016. He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pages 1804–1813. PMLR, 2016.
55.
Zurück zum Zitat Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
56.
Zurück zum Zitat Ernst A Heinz. New self-play results in computer chess. In International Conference on Computers and Games, pages 262–276. Springer, 2000. Ernst A Heinz. New self-play results in computer chess. In International Conference on Computers and Games, pages 262–276. Springer, 2000.
57.
Zurück zum Zitat Athul Paul Jacob, David J Wu, Gabriele Farina, Adam Lerer, Anton Bakhtin, Jacob Andreas, and Noam Brown. Modeling strong and human-like gameplay with KL-regularized search. arXiv preprint arXiv:2112.07544, 2021. Athul Paul Jacob, David J Wu, Gabriele Farina, Adam Lerer, Anton Bakhtin, Jacob Andreas, and Noam Brown. Modeling strong and human-like gameplay with KL-regularized search. arXiv preprint arXiv:2112.07544, 2021.
58.
Zurück zum Zitat John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
59.
Zurück zum Zitat Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729, 2018. Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729, 2018.
60.
Zurück zum Zitat Donald E Knuth and Ronald W Moore. An analysis of alpha-beta pruning. Artificial Intelligence, 6(4):293–326, 1975. Donald E Knuth and Ronald W Moore. An analysis of alpha-beta pruning. Artificial Intelligence, 6(4):293–326, 1975.
61.
Zurück zum Zitat Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In European Conference on Machine Learning, pages 282–293. Springer, 2006. Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In European Conference on Machine Learning, pages 282–293. Springer, 2006.
62.
Zurück zum Zitat Richard E Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial intelligence, 27(1):97–109, 1985. Richard E Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial intelligence, 27(1):97–109, 1985.
63.
Zurück zum Zitat Sarit Kraus, Eithan Ephrati, and Daniel Lehmann. Negotiation in a non-cooperative environment. Journal of Experimental & Theoretical Artificial Intelligence, 3(4):255–281, 1994.CrossRef Sarit Kraus, Eithan Ephrati, and Daniel Lehmann. Negotiation in a non-cooperative environment. Journal of Experimental & Theoretical Artificial Intelligence, 3(4):255–281, 1994.CrossRef
64.
Zurück zum Zitat Kai A Krueger and Peter Dayan. Flexible shaping: How learning in small steps helps. Cognition, 110(3):380–394, 2009. Kai A Krueger and Peter Dayan. Flexible shaping: How learning in small steps helps. Cognition, 110(3):380–394, 2009.
65.
Zurück zum Zitat Jan Kuipers, Aske Plaat, Jos AM Vermaseren, and H Jaap van den Herik. Improving multivariate Horner schemes with Monte Carlo tree search. Computer Physics Communications, 184(11):2391–2395, 2013. Jan Kuipers, Aske Plaat, Jos AM Vermaseren, and H Jaap van den Herik. Improving multivariate Horner schemes with Monte Carlo tree search. Computer Physics Communications, 184(11):2391–2395, 2013.
66.
Zurück zum Zitat Alexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, David Kas, Karl Hajjar, Torbjorn S Dahl, Amine Kerkeni, and Karim Beguir. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. arXiv preprint arXiv:1807.01672, 2018. Alexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, David Kas, Karl Hajjar, Torbjorn S Dahl, Amine Kerkeni, and Karim Beguir. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. arXiv preprint arXiv:1807.01672, 2018.
67.
Zurück zum Zitat Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019. Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019.
68.
Zurück zum Zitat Diego Pérez Liébana, Simon M Lucas, Raluca D Gaina, Julian Togelius, Ahmed Khalifa, and Jialin Liu. General video game artificial intelligence. Synthesis Lectures on Games and Computational Intelligence, 3(2):1–191, 2019. Diego Pérez Liébana, Simon M Lucas, Raluca D Gaina, Julian Togelius, Ahmed Khalifa, and Jialin Liu. General video game artificial intelligence. Synthesis Lectures on Games and Computational Intelligence, 3(2):1–191, 2019.
69.
Zurück zum Zitat Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. IEEE Trans. Neural Networks Learn. Syst., 31(9):3732–3740, 2020.CrossRef Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. IEEE Trans. Neural Networks Learn. Syst., 31(9):3732–3740, 2020.CrossRef
70.
Zurück zum Zitat Kiminori Matsuzaki. Empirical analysis of PUCT algorithm with evaluation functions of different quality. In 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pages 142–147. IEEE, 2018. Kiminori Matsuzaki. Empirical analysis of PUCT algorithm with evaluation functions of different quality. In 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pages 142–147. IEEE, 2018.
71.
Zurück zum Zitat Jonathan K Millen. Programming the game of Go. Byte Magazine, 1981. Jonathan K Millen. Programming the game of Go. Byte Magazine, 1981.
72.
Zurück zum Zitat S Ali Mirsoleimani, Aske Plaat, Jaap Van Den Herik, and Jos Vermaseren. Scaling Monte Carlo tree search on Intel Xeon Phi. In Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on, pages 666–673. IEEE, 2015. S Ali Mirsoleimani, Aske Plaat, Jaap Van Den Herik, and Jos Vermaseren. Scaling Monte Carlo tree search on Intel Xeon Phi. In Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on, pages 666–673. IEEE, 2015.
73.
Zurück zum Zitat Tom M Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Department of Computer Science, Rutgers University, 1980. Tom M Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Department of Computer Science, Rutgers University, 1980.
74.
Zurück zum Zitat Tom M Mitchell. The discipline of machine learning. Technical Report CMU-ML-06-108, Carnegie Mellon University, School of Computer Science, Machine Learning, 2006. Tom M Mitchell. The discipline of machine learning. Technical Report CMU-ML-06-108, Carnegie Mellon University, School of Computer Science, Machine Learning, 2006.
75.
Zurück zum Zitat Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
76.
Zurück zum Zitat Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.CrossRef Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.CrossRef
77.
Zurück zum Zitat Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. A0C: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613, 2018. Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. A0C: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613, 2018.
78.
Zurück zum Zitat Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. Monte Carlo tree search for asymmetric trees. arXiv preprint arXiv:1805.09218, 2018. Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. Monte Carlo tree search for asymmetric trees. arXiv preprint arXiv:1805.09218, 2018.
79.
Zurück zum Zitat Matthias Müller-Brockhausen, Mike Preuss, and Aske Plaat. Procedural content generation: Better benchmarks for transfer reinforcement learning. In Conference on Games, 2021. Matthias Müller-Brockhausen, Mike Preuss, and Aske Plaat. Procedural content generation: Better benchmarks for transfer reinforcement learning. In Conference on Games, 2021.
80.
Zurück zum Zitat Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal Machine Learning Research, 2020. Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal Machine Learning Research, 2020.
81.
Zurück zum Zitat Yu Nasu. Efficiently updatable neural-network-based evaluation functions for computer shogi. The 28th World Computer Shogi Championship Appeal Document, 2018. Yu Nasu. Efficiently updatable neural-network-based evaluation functions for computer shogi. The 28th World Computer Shogi Championship Appeal Document, 2018.
82.
Zurück zum Zitat Frans A Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016. Frans A Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
83.
Zurück zum Zitat Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007. Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007.
84.
Zurück zum Zitat Giuseppe Davide Paparo, Vedran Dunjko, Adi Makmal, Miguel Angel Martin-Delgado, and Hans J Briegel. Quantum speedup for active learning agents. Physical Review X, 4(3):031002, 2014. Giuseppe Davide Paparo, Vedran Dunjko, Adi Makmal, Miguel Angel Martin-Delgado, and Hans J Briegel. Quantum speedup for active learning agents. Physical Review X, 4(3):031002, 2014.
86.
Zurück zum Zitat Judea Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading, MA, 1984. Judea Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading, MA, 1984.
88.
Zurück zum Zitat Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie De Bruin. Best-first fixed-depth minimax algorithms. Artificial Intelligence, 87(1-2):255–293, 1996.MathSciNetCrossRef Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie De Bruin. Best-first fixed-depth minimax algorithms. Artificial Intelligence, 87(1-2):255–293, 1996.MathSciNetCrossRef
90.
Zurück zum Zitat Max Pumperla and Kevin Ferguson. Deep Learning and the Game of Go. Manning, 2019. Max Pumperla and Kevin Ferguson. Deep Learning and the Game of Go. Manning, 2019.
91.
Zurück zum Zitat J Ross Quinlan. Learning efficient classification procedures and their application to chess end games. In Machine Learning, pages 463–482. Springer, 1983. J Ross Quinlan. Learning efficient classification procedures and their application to chess end games. In Machine Learning, pages 463–482. Springer, 1983.
92.
Zurück zum Zitat Roberta Raileanu and Tim Rocktäschel. RIDE: rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020. Roberta Raileanu and Tim Rocktäschel. RIDE: rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020.
93.
Zurück zum Zitat Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011. Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
94.
Zurück zum Zitat Neil Rubens, Mehdi Elahi, Masashi Sugiyama, and Dain Kaplan. Active learning in recommender systems. In Recommender Systems Handbook, pages 809–846. Springer, 2015. Neil Rubens, Mehdi Elahi, Masashi Sugiyama, and Dain Kaplan. Active learning in recommender systems. In Recommender Systems Handbook, pages 809–846. Springer, 2015.
95.
Zurück zum Zitat Ben Ruijl, Jos Vermaseren, Aske Plaat, and Jaap van den Herik. HEPGAME and the simplification of expressions. arXiv preprint arXiv:1405.6369, 2014. Ben Ruijl, Jos Vermaseren, Aske Plaat, and Jaap van den Herik. HEPGAME and the simplification of expressions. arXiv preprint arXiv:1405.6369, 2014.
97.
Zurück zum Zitat Jonathan Schaeffer, Aske Plaat, and Andreas Junghanns. Unifying single-agent and two-player search. Information Sciences, 135(3-4):151–175, 2001.CrossRef Jonathan Schaeffer, Aske Plaat, and Andreas Junghanns. Unifying single-agent and two-player search. Information Sciences, 135(3-4):151–175, 2001.CrossRef
98.
Zurück zum Zitat Jürgen Schmidhuber. Curious model-building control systems. In Proceedings International Joint Conference on Neural Networks, pages 1458–1463, 1991. Jürgen Schmidhuber. Curious model-building control systems. In Proceedings International Joint Conference on Neural Networks, pages 1458–1463, 1991.
99.
Zurück zum Zitat Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018. Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018.
100.
Zurück zum Zitat Oliver G Selfridge, Richard S Sutton, and Andrew G Barto. Training and tracking in robotics. In International Joint Conference on Artificial Intelligence, pages 670–672, 1985. Oliver G Selfridge, Richard S Sutton, and Andrew G Barto. Training and tracking in robotics. In International Joint Conference on Artificial Intelligence, pages 670–672, 1985.
101.
Zurück zum Zitat Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Zídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, 2020. Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Zídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, 2020.
102.
Zurück zum Zitat Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
103.
Zurück zum Zitat Noor Shaker, Julian Togelius, and Mark J Nelson. Procedural Content Generation in Games. Springer, 2016. Noor Shaker, Julian Togelius, and Mark J Nelson. Procedural Content Generation in Games. Springer, 2016.
104.
Zurück zum Zitat Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.CrossRef Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.CrossRef
105.
Zurück zum Zitat Claude E Shannon. Programming a computer for playing chess. In Computer Chess Compendium, pages 2–13. Springer, 1988. Claude E Shannon. Programming a computer for playing chess. In Computer Chess Compendium, pages 2–13. Springer, 1988.
106.
Zurück zum Zitat David Silver. Reinforcement learning and simulation based search in the game of Go. PhD thesis, University of Alberta, 2009. David Silver. Reinforcement learning and simulation based search in the game of Go. PhD thesis, University of Alberta, 2009.
107.
Zurück zum Zitat David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
108.
Zurück zum Zitat David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.MathSciNetCrossRef David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.MathSciNetCrossRef
109.
Zurück zum Zitat David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
110.
Zurück zum Zitat David Silver, Richard S Sutton, and Martin Müller. Reinforcement learning of local shape in the game of Go. In International Joint Conference on Artificial Intelligence, volume 7, pages 1053–1058, 2007. David Silver, Richard S Sutton, and Martin Müller. Reinforcement learning of local shape in the game of Go. In International Joint Conference on Artificial Intelligence, volume 7, pages 1053–1058, 2007.
111.
Zurück zum Zitat David J Slate and Lawrence R Atkin. Chess 4.5—the northwestern university chess program. In Chess skill in Man and Machine, pages 82–118. Springer, 1983. David J Slate and Lawrence R Atkin. Chess 4.5—the northwestern university chess program. In Chess skill in Man and Machine, pages 82–118. Springer, 1983.
112.
Zurück zum Zitat Gillian Smith. An analog history of procedural content generation. In Foundations of Digital Games, 2015. Gillian Smith. An analog history of procedural content generation. In Foundations of Digital Games, 2015.
114.
Zurück zum Zitat Richard S Sutton and Andrew G Barto. Reinforcement learning, An Introduction, Second Edition. MIT Press, 2018. Richard S Sutton and Andrew G Barto. Reinforcement learning, An Introduction, Second Edition. MIT Press, 2018.
115.
Zurück zum Zitat Gerald Tesauro. Neurogammon wins Computer Olympiad. Neural Computation, 1(3):321–323, 1989.CrossRef Gerald Tesauro. Neurogammon wins Computer Olympiad. Neural Computation, 1(3):321–323, 1989.CrossRef
116.
Zurück zum Zitat Gerald Tesauro. TD-gammon: A self-teaching backgammon program. In Applications of Neural Networks, pages 267–285. Springer, 1995. Gerald Tesauro. TD-gammon: A self-teaching backgammon program. In Applications of Neural Networks, pages 267–285. Springer, 1995.
117.
Zurück zum Zitat Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.CrossRef Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.CrossRef
118.
Zurück zum Zitat Gerald Tesauro. Programming backgammon using self-teaching neural nets. Artificial Intelligence, 134(1-2):181–199, 2002.CrossRef Gerald Tesauro. Programming backgammon using self-teaching neural nets. Artificial Intelligence, 134(1-2):181–199, 2002.CrossRef
119.
Zurück zum Zitat Shantanu Thakoor, Surag Nair, and Megha Jhunjhunwala. Learning to play Othello without human knowledge. Stanford University CS238 Final Project Report, 2017. Shantanu Thakoor, Surag Nair, and Megha Jhunjhunwala. Learning to play Othello without human knowledge. Stanford University CS238 Final Project Report, 2017.
120.
Zurück zum Zitat Sebastian Thrun. Learning to play the game of chess. In Advances in Neural Information Processing Systems, pages 1069–1076, 1995. Sebastian Thrun. Learning to play the game of chess. In Advances in Neural Information Processing Systems, pages 1069–1076, 1995.
121.
Zurück zum Zitat Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. In Advances in Neural Information Processing Systems, pages 2659–2669, 2017. Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. In Advances in Neural Information Processing Systems, pages 2659–2669, 2017.
123.
Zurück zum Zitat Yuandong Tian and Yan Zhu. Better computer Go player with neural network and long-term prediction. In International Conference on Learning Representations, 2016. Yuandong Tian and Yan Zhu. Better computer Go player with neural network and long-term prediction. In International Conference on Learning Representations, 2016.
124.
Zurück zum Zitat Julian Togelius, Alex J Champandard, Pier Luca Lanzi, Michael Mateas, Ana Paiva, Mike Preuss, and Kenneth O Stanley. Procedural content generation: Goals, challenges and actionable steps. In Artificial and Computational Intelligence in Games. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2013. Julian Togelius, Alex J Champandard, Pier Luca Lanzi, Michael Mateas, Ana Paiva, Mike Preuss, and Kenneth O Stanley. Procedural content generation: Goals, challenges and actionable steps. In Artificial and Computational Intelligence in Games. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2013.
125.
Zurück zum Zitat Alan M Turing. Digital Computers Applied to Games. Pitman & Sons, 1953. Alan M Turing. Digital Computers Applied to Games. Pitman & Sons, 1953.
126.
Zurück zum Zitat Michiel Van Der Ree and Marco Wiering. Reinforcement learning in the game of Othello: learning against a fixed opponent and learning from self-play. In IEEE Adaptive Dynamic Programming and Reinforcement Learning, pages 108–115. IEEE, 2013. Michiel Van Der Ree and Marco Wiering. Reinforcement learning in the game of Othello: learning against a fixed opponent and learning from self-play. In IEEE Adaptive Dynamic Programming and Reinforcement Learning, pages 108–115. IEEE, 2013.
127.
Zurück zum Zitat Gerard JP Van Westen, Jörg K Wegner, Peggy Geluykens, Leen Kwanten, Inge Vereycken, Anik Peeters, Adriaan P IJzerman, Herman WT van Vlijmen, and Andreas Bender. Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PloS One, 6(11):e27518, 2011. Gerard JP Van Westen, Jörg K Wegner, Peggy Geluykens, Leen Kwanten, Inge Vereycken, Anik Peeters, Adriaan P IJzerman, Herman WT van Vlijmen, and Andreas Bender. Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PloS One, 6(11):e27518, 2011.
128.
Zurück zum Zitat Jos AM Vermaseren. New features of form. arXiv preprint math-ph/0010025, 2000. Jos AM Vermaseren. New features of form. arXiv preprint math-ph/0010025, 2000.
129.
Zurück zum Zitat Hui Wang, Michael Emmerich, Mike Preuss, and Aske Plaat. Alternative loss functions in AlphaZero-like self-play. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pages 155–162, 2019. Hui Wang, Michael Emmerich, Mike Preuss, and Aske Plaat. Alternative loss functions in AlphaZero-like self-play. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pages 155–162, 2019.
130.
Zurück zum Zitat Hui Wang, Mike Preuss, Michael Emmerich, and Aske Plaat. Tackling Morpion Solitaire with AlphaZero-like Ranked Reward reinforcement learning. In 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020, Timisoara, Romania, 2020. Hui Wang, Mike Preuss, Michael Emmerich, and Aske Plaat. Tackling Morpion Solitaire with AlphaZero-like Ranked Reward reinforcement learning. In 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020, Timisoara, Romania, 2020.
131.
Zurück zum Zitat Panqu Wang and Garrison W Cottrell. Basic level categorization facilitates visual object recognition. arXiv preprint arXiv:1511.04103, 2015. Panqu Wang and Garrison W Cottrell. Basic level categorization facilitates visual object recognition. arXiv preprint arXiv:1511.04103, 2015.
132.
Zurück zum Zitat Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, pages 5235–5243, 2018. Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, pages 5235–5243, 2018.
134.
Zurück zum Zitat Marco A Wiering. Self-play and using an expert to learn to play backgammon with temporal difference learning. JILSA, 2(2):57–68, 2010. Marco A Wiering. Self-play and using an expert to learn to play backgammon with temporal difference learning. JILSA, 2(2):57–68, 2010.
Metadaten
Titel
Two-Agent Self-Play
verfasst von
Aske Plaat
Copyright-Jahr
2022
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-19-0638-1_6

Premium Partner