nach oben

Erschienen in:

2022 | OriginalPaper | Buchkapitel

6. Two-Agent Self-Play

verfasst von : Aske Plaat

Erschienen in: Deep Reinforcement Learning

Verlag: Springer Nature Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Previous chapters were concerned with how a single agent can learn optimal behavior for its environment. This chapter is different. We turn to problems where two agents operate whose behavior will both be modeled (and, in the next chapter, more than two).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Model-Based Reinforcement Learning

Nächstes Kapitel Multi-Agent Reinforcement Learning

A modern reimplementation of TD-Gammon in TensorFlow is available on GitHub at TD-Gammon https://github.com/fomorians/td-gammon.

For example, the maximal state space of tic-tac-toe is 3⁹ = 19683 positions (9 squares of “X,” “O,” or blank), where only 765 positions remain if we remove symmetrical and illegal positions [96].

Drosophila Melanogaster is also known as the fruitfly, a favorite species of genetics researchers to test their theories, because experiments produce quick and clear answers.

Absolute beginners in Go start at 30 kyu, progressing to 10 kyu and advancing to 1 kyu (30k–1k). Stronger amateur players then achieve 1 dan, progressing to 7 dan, the highest amateur rating for Go (1d–7d). Professional Go players have a rating from 1 dan to 9 dan, written as 1p–9p.

There is also research into opponent modeling, where we try to exploit our opponent’s weaknesses [14, 47, 54]. Here, we assume an identical opponent, which often works best in chess and Go.

Because the agent knows the transition function T, it can calculate the new state s′ for each action a. The reward r is calculated at terminal states, where it is equal to the value v. Hence, in this diagram, the search function provides the state to the eval function. See [87, 125] for an explanation of the search-eval architecture.

The heuristic evaluation function is originally a linear combination of hand-crafted heuristic rules, such as material balance (which side has more pieces) or center control. At first, the linear combinations (coefficients) were not only hand-coded but also hand-tuned. Later they were trained by supervised learning [10, 46, 91, 120]. More recently, NNUE was introduced as a nonlinear neural network to use as evaluation function in an alpha–beta framework [81].

Compare chess and Go: in chess, the typical number of moves in a position is 25, and for Go this number is 250. A chess-tree of depth 5 has 25⁵ = 9765625 leaves. A Go-tree of depth 5 has 250⁵ = 976562500000 leaves. A depth-5 minimax search in Go would take prohibitively long; an MCTS search of 1000 expansions expands the same number of paths from root to leaf in both games.

Originally, playouts were random (the Monte Carlo part in the name of MCTS) following Brügmann’s [18] and Bouzy and Helmstetter’s [15] original approach. In practice, most Go playing programs improve on the random playouts by using databases of small 3 × 3 patterns with best replies and other fast heuristics [24, 31, 33, 50, 106]. Small amounts of domain knowledge are used after all, albeit not in the form of a heuristic evaluation function.

https://int8.io/monte-carlo-tree-search-beginners-guide/.

The square root term is a measure of the variance (uncertainty) of the action value. The use of the natural logarithm ensures that, since increases get smaller over time, old actions are selected less frequently. However, since logarithm values are unbounded, eventually all actions will be selected [114].

Note further the small differences under the square root (no logarithm, and the 1 in the denominator) also change the UCT function profile somewhat, ensuring correct behavior at unvisited actions [77].

Such a sequence of related learning tasks corresponds to a meta-learning problem. In meta-learning, the aim is to learn a new task fast, by using the knowledge learned from previous, related, tasks; see Chap. 9.

See also generative adversarial networks and deep dreaming, for a connectionist approach to content generation, Sect. B.2.6.2.

TPU stands for tensor processing unit, a low-precision design specifically developed for fast neural network processing.

The basis of the Elo rating is pairwise comparison [42]. Elo is often used to compare playing strength in board games.

Treat as if human.

Although an AlphaZero version that has learned to play Go cannot play chess, it has to re-learn chess from scratch, with different input and output layers.

https://github.com/suragnair/alpha-zero-general.

https://github.com/facebookresearch/darkforestGo.

https://github.com/pytorch/ELF.

https://github.com/gcp/leela-zero.

https://github.com/Tencent/PhoenixGo.

https://github.com/facebookincubator/Polygames.

https://www.maths.ed.ac.uk/~csangwin/hex/index.html.

https://github.com/facebookincubator/Polygames.

https://pytorch.org.

https://www.alphagomovie.com.

https://www.youtube.com/watch?v=MgowR4pq3e8.

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.

Bruce Abramson. Expected-outcome: A general model of static evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(2):182–193, 1990.CrossRef

Anonymous. Go AI strength vs. time. Reddit post, 2017.

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370, 2017.

Oleg Arenz. Monte Carlo Chess. Master’s thesis, Universität Darmstadt, 2012.

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.MathSciNetMATH

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002.CrossRef

Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1–2):55–65, 2010.MathSciNetCrossRef

Jonathan Baxter, Andrew Tridgell, and Lex Weaver. Knightcap: a chess program that learns by combining TD (λ) with game-tree search. arXiv preprint cs/9901002, 1999.

10.

Jonathan Baxter, Andrew Tridgell, and Lex Weaver. Learning to play chess using temporal differences. Machine Learning, 40(3):243–263, 2000.CrossRef

11.

Don Beal and Martin C. Smith. Temporal difference learning for heuristic search and game playing. Information Sciences, 122(1):3–21, 2000.CrossRef

12.

Laurens Beljaards. AI agents for the abstract strategy game Tak. Master’s thesis, Leiden University, 2017.

13.

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48, 2009.

14.

Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. Opponent modeling in poker. AAAI/IAAI, 493:499, 1998.

15.

Bruno Bouzy and Bernard Helmstetter. Monte Carlo Go developments. In Advances in Computer Games, pages 159–174. Springer, 2004.

16.

Cameron Browne. Hex Strategy. AK Peters/CRC Press, 2000.MATH

17.

Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo Tree Search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.

18.

Bernd Brügmann. Monte Carlo Go. Technical report, Syracuse University, 1993.

19.

Andres Campero, Roberta Raileanu, Heinrich Küttler, Joshua B Tenenbaum, Tim Rocktäschel, and Edward Grefenstette. Learning with AMIGo: Adversarially motivated intrinsic goals. In International Conference on Learning Representations, 2020.

20.

Tristan Cazenave. Residual networks for computer Go. IEEE Transactions on Games, 10(1):107–110, 2018.CrossRef

21.

Tristan Cazenave, Yen-Chi Chen, Guan-Wei Chen, Shi-Yu Chen, Xian-Dong Chiu, Julien Dehos, Maria Elsa, Qucheng Gong, Hengyuan Hu, Vasil Khalidov, Cheng-Ling Li, Hsin-I Lin, Yu-Jin Lin, Xavier Martinet, Vegard Mella, Jérémy Rapin, Baptiste Rozière, Gabriel Synnaeve, Fabien Teytaud, Olivier Teytaud, Shi-Cheng Ye, Yi-Jun Ye, Shi-Jim Yen, and Sergey Zagoruyko. Polygames: Improved zero learning. arXiv preprint arXiv:2001.09832, 2020.

22.

Tristan Cazenave and Bernard Helmstetter. Combining tactical search and Monte-Carlo in the game of Go. In Proceedings of the 2005 IEEE Symposium on Computational Intelligence and Games (CIG05), Essex University, volume 5, pages 171–175, 2005.

23.

Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53(1):126–139, 2005.

24.

Guillaume Chaslot. Monte-Carlo tree search. PhD thesis, Maastricht University, 2010.

25.

Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-Carlo tree search: A new framework for game AI. In AIIDE, 2008.

26.

Maxime Chevalier-Boisvert, Lucas Willems, and Sumans Pal. Minimalistic gridworld environment for OpenAI Gym https://github.com/maximecb/gym-minigrid, 2018.

27.

Christopher Clark and Amos Storkey. Teaching deep convolutional neural networks to play Go. arxiv preprint. arXiv preprint arXiv:1412.3409, 1, 2014.

28.

Christopher Clark and Amos Storkey. Training deep convolutional neural networks to play Go. In International Conference on Machine Learning, pages 1766–1774, 2015.

29.

Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International Conference on Machine Learning, pages 2048–2056. PMLR, 2020.

30.

Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo Tree Search. In International Conference on Computers and Games, pages 72–83. Springer, 2006.

31.

Rémi Coulom. Monte-Carlo tree search in Crazy Stone. In Proceedings Game Programming Workshop, Tokyo, Japan, pages 74–75, 2007.

32.

Rémi Coulom. The Monte-Carlo revolution in Go. In The Japanese-French Frontiers of Science Symposium (JFFoS 2008), Roscoff, France, 2009.

33.

Joseph C Culberson and Jonathan Schaeffer. Pattern databases. Computational Intelligence, 14(3):318–334, 1998.

34.

Wojciech Marian Czarnecki, Gauthier Gidel, Brendan Tracey, Karl Tuyls, Shayegan Omidshafiei, David Balduzzi, and Max Jaderberg. Real world games look like spinning tops. In Advances in Neural Information Processing Systems, 2020.

35.

Kamil Czarnogórski. Monte Carlo Tree Search beginners guide https://int8.io/monte-carlo-tree-search-beginners-guide/, 2018.

36.

Shubhomoy Das, Weng-Keen Wong, Thomas Dietterich, Alan Fern, and Andrew Emmott. Incorporating expert feedback into active anomaly discovery. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 853–858. IEEE, 2016.

37.

Dave De Jonge, Tim Baarslag, Reyhan Aydoğan, Catholijn Jonker, Katsuhide Fujita, and Takayuki Ito. The challenge of negotiation in the game of diplomacy. In International Conference on Agreement Technologies, pages 100–114. Springer, 2018.

38.

Thang Doan, Joao Monteiro, Isabela Albuquerque, Bogdan Mazoure, Audrey Durand, Joelle Pineau, and R Devon Hjelm. On-line adaptative curriculum learning for GANs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3470–3477, 2019.

39.

Christian Donninger. Null move and deep search. ICGA Journal, 16(3):137–143, 1993.CrossRef

40.

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL²: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.

41.

Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.

42.

Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.

43.

Markus Enzenberger, Martin Muller, Broderick Arneson, and Richard Segal. Fuego—an open-source framework for board games and Go engine based on Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):259–270, 2010.CrossRef

44.

Dieqiao Feng, Carla P Gomes, and Bart Selman. Solving hard AI planning instances using curriculum-driven deep reinforcement learning. arXiv preprint arXiv:2006.02689, 2020.

45.

Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pages 1515–1528. PMLR, 2018.

46.

David B Fogel, Timothy J Hays, Sarah L Hahn, and James Quon. Further evolution of a self-learning chess program. In Computational Intelligence in Games, 2005.

47.

Sam Ganzfried and Tuomas Sandholm. Game theory-based opponent modeling in large imperfect-information games. In The 10th International Conference on Autonomous Agents and Multiagent Systems, volume 2, pages 533–540, 2011.

48.

Sylvain Gelly, Levente Kocsis, Marc Schoenauer, Michele Sebag, David Silver, Csaba Szepesvári, and Olivier Teytaud. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3):106–113, 2012.CrossRef

49.

Sylvain Gelly and David Silver. Achieving master level play in 9 × 9 computer Go. In AAAI, volume 8, pages 1537–1540, 2008.

50.

Sylvain Gelly, Yizao Wang, and Olivier Teytaud. Modification of UCT with patterns in Monte-Carlo Go. Technical Report RR-6062, INRIA, 2006.

51.

Tobias Graf and Marco Platzner. Adaptive playouts in Monte-Carlo tree search with policy-gradient reinforcement learning. In Advances in Computer Games, pages 1–11. Springer, 2015.

52.

Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Rémi Munos. Monte-Carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pages 3769–3778. PMLR, 2020.

53.

Ryan B Hayward and Bjarne Toft. Hex: The Full Story. CRC Press, 2019.

54.

He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pages 1804–1813. PMLR, 2016.

55.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.

56.

Ernst A Heinz. New self-play results in computer chess. In International Conference on Computers and Games, pages 262–276. Springer, 2000.

57.

Athul Paul Jacob, David J Wu, Gabriele Farina, Adam Lerer, Anton Bakhtin, Jacob Andreas, and Noam Brown. Modeling strong and human-like gameplay with KL-regularized search. arXiv preprint arXiv:2112.07544, 2021.

58.

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.

59.

Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729, 2018.

60.

Donald E Knuth and Ronald W Moore. An analysis of alpha-beta pruning. Artificial Intelligence, 6(4):293–326, 1975.

61.

Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In European Conference on Machine Learning, pages 282–293. Springer, 2006.

62.

Richard E Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial intelligence, 27(1):97–109, 1985.

63.

Sarit Kraus, Eithan Ephrati, and Daniel Lehmann. Negotiation in a non-cooperative environment. Journal of Experimental & Theoretical Artificial Intelligence, 3(4):255–281, 1994.CrossRef

64.

Kai A Krueger and Peter Dayan. Flexible shaping: How learning in small steps helps. Cognition, 110(3):380–394, 2009.

65.

Jan Kuipers, Aske Plaat, Jos AM Vermaseren, and H Jaap van den Herik. Improving multivariate Horner schemes with Monte Carlo tree search. Computer Physics Communications, 184(11):2391–2395, 2013.

66.

Alexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, David Kas, Karl Hajjar, Torbjorn S Dahl, Amine Kerkeni, and Karim Beguir. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. arXiv preprint arXiv:1807.01672, 2018.

67.

Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. arXiv preprint arXiv:1903.00742, 2019.

68.

Diego Pérez Liébana, Simon M Lucas, Raluca D Gaina, Julian Togelius, Ahmed Khalifa, and Jialin Liu. General video game artificial intelligence. Synthesis Lectures on Games and Computational Intelligence, 3(2):1–191, 2019.

69.

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. IEEE Trans. Neural Networks Learn. Syst., 31(9):3732–3740, 2020.CrossRef

70.

Kiminori Matsuzaki. Empirical analysis of PUCT algorithm with evaluation functions of different quality. In 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pages 142–147. IEEE, 2018.

71.

Jonathan K Millen. Programming the game of Go. Byte Magazine, 1981.

72.

S Ali Mirsoleimani, Aske Plaat, Jaap Van Den Herik, and Jos Vermaseren. Scaling Monte Carlo tree search on Intel Xeon Phi. In Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on, pages 666–673. IEEE, 2015.

73.

Tom M Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Department of Computer Science, Rutgers University, 1980.

74.

Tom M Mitchell. The discipline of machine learning. Technical Report CMU-ML-06-108, Carnegie Mellon University, School of Computer Science, Machine Learning, 2006.

75.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

76.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.CrossRef

77.

Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. A0C: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613, 2018.

78.

Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. Monte Carlo tree search for asymmetric trees. arXiv preprint arXiv:1805.09218, 2018.

79.

Matthias Müller-Brockhausen, Mike Preuss, and Aske Plaat. Procedural content generation: Better benchmarks for transfer reinforcement learning. In Conference on Games, 2021.

80.

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal Machine Learning Research, 2020.

81.

Yu Nasu. Efficiently updatable neural-network-based evaluation functions for computer shogi. The 28th World Computer Shogi Championship Appeal Document, 2018.

82.

Frans A Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.

83.

Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007.

84.

Giuseppe Davide Paparo, Vedran Dunjko, Adi Makmal, Miguel Angel Martin-Delgado, and Hans J Briegel. Quantum speedup for active learning agents. Physical Review X, 4(3):031002, 2014.

85.

Gian-Carlo Pascutto. Leela zero. https://github.com/leela-zero/leela-zero, 2017.

86.

Judea Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading, MA, 1984.

87.

Aske Plaat. Learning to Play: Reinforcement Learning and Games. Springer Verlag, Heidelberg, https://learningtoplay.net, 2020.

88.

Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie De Bruin. Best-first fixed-depth minimax algorithms. Artificial Intelligence, 87(1-2):255–293, 1996.MathSciNetCrossRef

89.

Aditya Prasad. Lessons from implementing AlphaZero https://medium.com/oracledevs/lessons-from-implementing-alphazero-7e36e9054191, 2018.

90.

Max Pumperla and Kevin Ferguson. Deep Learning and the Game of Go. Manning, 2019.

91.

J Ross Quinlan. Learning efficient classification procedures and their application to chess end games. In Machine Learning, pages 463–482. Springer, 1983.

92.

Roberta Raileanu and Tim Rocktäschel. RIDE: rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020.

93.

Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.

94.

Neil Rubens, Mehdi Elahi, Masashi Sugiyama, and Dain Kaplan. Active learning in recommender systems. In Recommender Systems Handbook, pages 809–846. Springer, 2015.

95.

Ben Ruijl, Jos Vermaseren, Aske Plaat, and Jaap van den Herik. HEPGAME and the simplification of expressions. arXiv preprint arXiv:1405.6369, 2014.

96.

Steve Schaefer. Mathematical recreations. http://www.mathrec.org/old/2002jan/solutions.html, 2002.

97.

Jonathan Schaeffer, Aske Plaat, and Andreas Junghanns. Unifying single-agent and two-player search. Information Sciences, 135(3-4):151–175, 2001.CrossRef

98.

Jürgen Schmidhuber. Curious model-building control systems. In Proceedings International Joint Conference on Neural Networks, pages 1458–1463, 1991.

99.

Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018.

100.

Oliver G Selfridge, Richard S Sutton, and Andrew G Barto. Training and tracking in robotics. In International Joint Conference on Artificial Intelligence, pages 670–672, 1985.

101.

Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Zídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, 2020.

102.

Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.

103.

Noor Shaker, Julian Togelius, and Mark J Nelson. Procedural Content Generation in Games. Springer, 2016.

104.

Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.CrossRef

105.

Claude E Shannon. Programming a computer for playing chess. In Computer Chess Compendium, pages 2–13. Springer, 1988.

106.

David Silver. Reinforcement learning and simulation based search in the game of Go. PhD thesis, University of Alberta, 2009.

107.

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.

108.

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.MathSciNetCrossRef

109.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.

110.

David Silver, Richard S Sutton, and Martin Müller. Reinforcement learning of local shape in the game of Go. In International Joint Conference on Artificial Intelligence, volume 7, pages 1053–1058, 2007.

111.

David J Slate and Lawrence R Atkin. Chess 4.5—the northwestern university chess program. In Chess skill in Man and Machine, pages 82–118. Springer, 1983.

112.

Gillian Smith. An analog history of procedural content generation. In Foundations of Digital Games, 2015.

113.

Darin Straus. Alphazero implementation and tutorial. https://towardsdatascience.com/alphazero-implementation-and-tutorial-f4324d65fdfc, 2018.

114.

Richard S Sutton and Andrew G Barto. Reinforcement learning, An Introduction, Second Edition. MIT Press, 2018.

115.

Gerald Tesauro. Neurogammon wins Computer Olympiad. Neural Computation, 1(3):321–323, 1989.CrossRef

116.

Gerald Tesauro. TD-gammon: A self-teaching backgammon program. In Applications of Neural Networks, pages 267–285. Springer, 1995.

117.

Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.CrossRef

118.

Gerald Tesauro. Programming backgammon using self-teaching neural nets. Artificial Intelligence, 134(1-2):181–199, 2002.CrossRef

119.

Shantanu Thakoor, Surag Nair, and Megha Jhunjhunwala. Learning to play Othello without human knowledge. Stanford University CS238 Final Project Report, 2017.

120.

Sebastian Thrun. Learning to play the game of chess. In Advances in Neural Information Processing Systems, pages 1069–1076, 1995.

121.

Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. In Advances in Neural Information Processing Systems, pages 2659–2669, 2017.

122.

Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, and C. Lawrence Zitnick. ELF OpenGo. https://github.com/pytorch/ELF, 2018.

123.

Yuandong Tian and Yan Zhu. Better computer Go player with neural network and long-term prediction. In International Conference on Learning Representations, 2016.

124.

Julian Togelius, Alex J Champandard, Pier Luca Lanzi, Michael Mateas, Ana Paiva, Mike Preuss, and Kenneth O Stanley. Procedural content generation: Goals, challenges and actionable steps. In Artificial and Computational Intelligence in Games. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2013.

125.

Alan M Turing. Digital Computers Applied to Games. Pitman & Sons, 1953.

126.

Michiel Van Der Ree and Marco Wiering. Reinforcement learning in the game of Othello: learning against a fixed opponent and learning from self-play. In IEEE Adaptive Dynamic Programming and Reinforcement Learning, pages 108–115. IEEE, 2013.

127.

Gerard JP Van Westen, Jörg K Wegner, Peggy Geluykens, Leen Kwanten, Inge Vereycken, Anik Peeters, Adriaan P IJzerman, Herman WT van Vlijmen, and Andreas Bender. Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PloS One, 6(11):e27518, 2011.

128.

Jos AM Vermaseren. New features of form. arXiv preprint math-ph/0010025, 2000.

129.

Hui Wang, Michael Emmerich, Mike Preuss, and Aske Plaat. Alternative loss functions in AlphaZero-like self-play. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pages 155–162, 2019.

130.

Hui Wang, Mike Preuss, Michael Emmerich, and Aske Plaat. Tackling Morpion Solitaire with AlphaZero-like Ranked Reward reinforcement learning. In 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020, Timisoara, Romania, 2020.

131.

Panqu Wang and Garrison W Cottrell. Basic level categorization facilitates visual object recognition. arXiv preprint arXiv:1511.04103, 2015.

132.

Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, pages 5235–5243, 2018.

133.

Lilian Weng. Curriculum for reinforcement learning https://lilianweng.github.io/lil-log/2020/01/29/curriculum-for-reinforcement-learning.html. Lil’Log, January 2020.

134.

Marco A Wiering. Self-play and using an expert to learn to play backgammon with temporal difference learning. JILSA, 2(2):57–68, 2010.

135.

Qinsong Zeng, Jianchang Zhang, Zhanpeng Zeng, Yongsheng Li, Ming Chen, and Sifan Liu. PhoenixGo. https://github.com/Tencent/PhoenixGo, 2018.

Titel: Two-Agent Self-Play
verfasst von: Aske Plaat
Verlag: Springer Nature Singapore
Buch: Deep Reinforcement Learning
Print ISBN: 978-981-19-0637-4

Electronic ISBN: 978-981-19-0638-1

Copyright-Jahr: 2022
DOI: https://doi.org/10.1007/978-981-19-0638-1_6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner