Skip to main content
main-content

Tipp

Weitere Artikel dieser Ausgabe durch Wischen aufrufen

05.01.2022

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

verfasst von: Weichao Mao, Tamer Başar

Erschienen in: Dynamic Games and Applications

Einloggen, um Zugang zu erhalten
share
TEILEN

Abstract

This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents’ strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an \(\epsilon \)-approximate CCE in at most \(\widetilde{O}( H^6S A /\epsilon ^2)\) episodes, where S is the number of states, A is the size of the largest individual action space, and H is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.

Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 69.000 Bücher
  • über 500 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Testen Sie jetzt 15 Tage kostenlos.

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 50.000 Bücher
  • über 380 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




Testen Sie jetzt 15 Tage kostenlos.

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 58.000 Bücher
  • über 300 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Testen Sie jetzt 15 Tage kostenlos.

Fußnoten
1
This setting has been studied under various names in the literature, including individual learning [23], decentralized learning [1], agnostic learning [41, 45], and independent learning [11, 13]. It also belongs to a more general category of teams/games with decentralized information structure [18, 29, 30].
 
2
A general policy [5] of agent i is a set of maps \(\pi ^i = \{\pi ^i_h : \mathbb {R}\times (\mathcal {S}\times \mathcal {A}^i\times \mathbb {R})^{h-1}\times \mathcal {S}\rightarrow \Delta (\mathcal {A}^i)\}_{h\in [H]}\) from a random variable \(z\in \mathbb {R}\) and a history of length \(h-1\) to a distribution of actions in \(\mathcal {A}^i\).
 
Literatur
1.
Zurück zum Zitat Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans Autom Control 62(4):1545–1558 MathSciNetCrossRef Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans Autom Control 62(4):1545–1558 MathSciNetCrossRef
2.
Zurück zum Zitat Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Foundations of computer science. IEEE, pp 322–331 Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Foundations of computer science. IEEE, pp 322–331
3.
Zurück zum Zitat Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econom J Econom Soc 55:1–18 MathSciNetMATH Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econom J Econom Soc 55:1–18 MathSciNetMATH
4.
Zurück zum Zitat Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning, pp 551–560 Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning, pp 551–560
5.
Zurück zum Zitat Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. In: Advances in neural information processing systems, vol 33 Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. In: Advances in neural information processing systems, vol 33
6.
Zurück zum Zitat Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840 MathSciNetCrossRef Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840 MathSciNetCrossRef
7.
Zurück zum Zitat Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424 MathSciNetCrossRef Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424 MathSciNetCrossRef
8.
Zurück zum Zitat Bubeck S et al (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8(3–4):231–357 Bubeck S et al (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8(3–4):231–357
9.
Zurück zum Zitat Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge CrossRef Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge CrossRef
10.
Zurück zum Zitat Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J ACM 56(3):1–57 MathSciNetCrossRef Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J ACM 56(3):1–57 MathSciNetCrossRef
11.
Zurück zum Zitat Claus C (1998) Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI conference on artificial intelligence, vol 746–752, p 2 Claus C (1998) Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI conference on artificial intelligence, vol 746–752, p 2
12.
Zurück zum Zitat Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J Comput 39(1):195–259 MathSciNetCrossRef Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J Comput 39(1):195–259 MathSciNetCrossRef
13.
Zurück zum Zitat Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. In: Advances in neural information processing systems, vol 33 Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. In: Advances in neural information processing systems, vol 33
14.
Zurück zum Zitat Fang H, Harvey N, Portella V, Friedlander M (2020) Online mirror descent and dual averaging: keeping pace in the dynamic case. In: International conference on machine learning, pp 3008–3017 Fang H, Harvey N, Portella V, Friedlander M (2020) Online mirror descent and dual averaging: keeping pace in the dynamic case. In: International conference on machine learning, pp 3008–3017
15.
Zurück zum Zitat Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer, Berlin MATH Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer, Berlin MATH
16.
Zurück zum Zitat Greenwald A, Hall K (2003) Correlated-Q learning. In: International conference on machine learning, pp 242–249 Greenwald A, Hall K (2003) Correlated-Q learning. In: International conference on machine learning, pp 242–249
17.
Zurück zum Zitat Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150 MathSciNetCrossRef Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150 MathSciNetCrossRef
18.
Zurück zum Zitat Ho YC (1980) Team decision theory and information structures. Proc IEEE 68(6):644–654 CrossRef Ho YC (1980) Team decision theory and information structures. Proc IEEE 68(6):644–654 CrossRef
19.
Zurück zum Zitat Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069 MathSciNetMATH Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069 MathSciNetMATH
20.
Zurück zum Zitat Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11(4):1563–1600 MathSciNetMATH Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11(4):1563–1600 MathSciNetMATH
21.
Zurück zum Zitat Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? In: International conference on neural information processing systems, pp 4868–4878 Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? In: International conference on neural information processing systems, pp 4868–4878
22.
Zurück zum Zitat Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274 CrossRef Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274 CrossRef
23.
Zurück zum Zitat Leslie DS, Collins EJ (2005) Individual Q-learning in normal form games. SIAM J Control Optim 44(2):495–514 MathSciNetCrossRef Leslie DS, Collins EJ (2005) Individual Q-learning in normal form games. SIAM J Control Optim 44(2):495–514 MathSciNetCrossRef
24.
Zurück zum Zitat Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning, pp 157–163 Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning, pp 157–163
25.
Zurück zum Zitat Littman ML (2001) Friend-or-Foe Q-learning in general-sum games. In: International conference on machine learning, pp 322–328 Littman ML (2001) Friend-or-Foe Q-learning in general-sum games. In: International conference on machine learning, pp 322–328
26.
Zurück zum Zitat Littman ML, Szepesvári C (1996) A generalized reinforcement-learning model: convergence and applications. In: International conference on machine learning, pp 310–318 Littman ML, Szepesvári C (1996) A generalized reinforcement-learning model: convergence and applications. In: International conference on machine learning, pp 310–318
27.
Zurück zum Zitat Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. In: International conference on machine learning. PMLR, pp 7001–7010 Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. In: International conference on machine learning. PMLR, pp 7001–7010
28.
Zurück zum Zitat Moulin H, Vial JP (1978) Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon. Int J Game Theory 7(3–4):201–221 MathSciNetCrossRef Moulin H, Vial JP (1978) Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon. Int J Game Theory 7(3–4):201–221 MathSciNetCrossRef
29.
Zurück zum Zitat Nayyar A, Gupta A, Langbort C, Başar T (2013) Common information based Markov perfect equilibria for stochastic games with asymmetric information: finite games. IEEE Trans Autom Control 59(3):555–570 MathSciNetCrossRef Nayyar A, Gupta A, Langbort C, Başar T (2013) Common information based Markov perfect equilibria for stochastic games with asymmetric information: finite games. IEEE Trans Autom Control 59(3):555–570 MathSciNetCrossRef
30.
Zurück zum Zitat Nayyar A, Mahajan A, Teneketzis D (2013) Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans Autom Control 58(7):1644–1658 MathSciNetCrossRef Nayyar A, Mahajan A, Teneketzis D (2013) Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans Autom Control 58(7):1644–1658 MathSciNetCrossRef
31.
Zurück zum Zitat Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley-Interscience, New York MATH Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley-Interscience, New York MATH
32.
Zurück zum Zitat Neu G (2015) Explore no more: improved high-probability regret bounds for non-stochastic bandits. Adv Neural Inf Process Syst 28:3168–3176 Neu G (2015) Explore no more: improved high-probability regret bounds for non-stochastic bandits. Adv Neural Inf Process Syst 28:3168–3176
34.
Zurück zum Zitat Papadimitriou CH, Roughgarden T (2008) Computing correlated equilibria in multi-player games. J ACM 55(3):1–29 MathSciNetCrossRef Papadimitriou CH, Roughgarden T (2008) Computing correlated equilibria in multi-player games. J ACM 55(3):1–29 MathSciNetCrossRef
35.
Zurück zum Zitat Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In: Artificial intelligence and statistics. PMLR, pp 232–241 Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In: Artificial intelligence and statistics. PMLR, pp 232–241
36.
Zurück zum Zitat Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International conference on autonomous agents and multiagent systems, pp 1371–1379 Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International conference on autonomous agents and multiagent systems, pp 1371–1379
37.
Zurück zum Zitat Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:​1610.​03295 Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:​1610.​03295
39.
Zurück zum Zitat Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: International conference on artificial intelligence and statistics. PMLR, pp 2992–3002 Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: International conference on artificial intelligence and statistics. PMLR, pp 2992–3002
40.
Zurück zum Zitat Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489 CrossRef Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489 CrossRef
41.
Zurück zum Zitat Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. In: International conference on machine learning Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. In: International conference on machine learning
42.
Zurück zum Zitat Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354 CrossRef Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354 CrossRef
43.
Zurück zum Zitat Wang X, Sandholm T (2002) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Adv Neural Inf Process Syst 15:1603–1610 Wang X, Sandholm T (2002) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Adv Neural Inf Process Syst 15:1603–1610
44.
Zurück zum Zitat Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. In: International conference on neural information processing systems, pp 4994–5004 Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. In: International conference on neural information processing systems, pp 4994–5004
45.
Zurück zum Zitat Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. In: Annual conference on learning theory Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. In: Annual conference on learning theory
46.
Zurück zum Zitat Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In: Conference on learning theory, pp 3674–3682 Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In: Conference on learning theory, pp 3674–3682
47.
Zurück zum Zitat Yongacoglu B, Arslan G, Yüksel S (2019) Learning team-optimality for decentralized stochastic control and dynamic games. arXiv preprint arXiv:​1903.​05812 Yongacoglu B, Arslan G, Yüksel S (2019) Learning team-optimality for decentralized stochastic control and dynamic games. arXiv preprint arXiv:​1903.​05812
48.
Zurück zum Zitat Yüksel S, Başar T (2013) Stochastic networked control systems: stabilization and optimization under information constraints. Springer, Berlin CrossRef Yüksel S, Başar T (2013) Stochastic networked control systems: stabilization and optimization under information constraints. Springer, Berlin CrossRef
50.
Zurück zum Zitat Zhao Y, Tian Y, Lee JD, Du SS (2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. arXiv preprint arXiv:​2102.​08903 Zhao Y, Tian Y, Lee JD, Du SS (2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. arXiv preprint arXiv:​2102.​08903
51.
Zurück zum Zitat Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: International conference on machine learning, pp 928–936 Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: International conference on machine learning, pp 928–936
Metadaten
Titel
Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
verfasst von
Weichao Mao
Tamer Başar
Publikationsdatum
05.01.2022
Verlag
Springer US
Erschienen in
Dynamic Games and Applications
Print ISSN: 2153-0785
Elektronische ISSN: 2153-0793
DOI
https://doi.org/10.1007/s13235-021-00420-0

Premium Partner