Skip to main content
main-content
Top

Hint

Swipe to navigate through the articles of this issue

05-01-2022

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

Authors: Weichao Mao, Tamer Başar

Published in: Dynamic Games and Applications

Login to get access
share
SHARE

Abstract

This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents’ strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an \(\epsilon \)-approximate CCE in at most \(\widetilde{O}( H^6S A /\epsilon ^2)\) episodes, where S is the number of states, A is the size of the largest individual action space, and H is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.

To get access to this content you need the following product:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 69.000 Bücher
  • über 500 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Testen Sie jetzt 15 Tage kostenlos.

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 50.000 Bücher
  • über 380 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




Testen Sie jetzt 15 Tage kostenlos.

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 58.000 Bücher
  • über 300 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Testen Sie jetzt 15 Tage kostenlos.

Footnotes
1
This setting has been studied under various names in the literature, including individual learning [23], decentralized learning [1], agnostic learning [41, 45], and independent learning [11, 13]. It also belongs to a more general category of teams/games with decentralized information structure [18, 29, 30].
 
2
A general policy [5] of agent i is a set of maps \(\pi ^i = \{\pi ^i_h : \mathbb {R}\times (\mathcal {S}\times \mathcal {A}^i\times \mathbb {R})^{h-1}\times \mathcal {S}\rightarrow \Delta (\mathcal {A}^i)\}_{h\in [H]}\) from a random variable \(z\in \mathbb {R}\) and a history of length \(h-1\) to a distribution of actions in \(\mathcal {A}^i\).
 
Literature
1.
go back to reference Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans Autom Control 62(4):1545–1558 MathSciNetCrossRef Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans Autom Control 62(4):1545–1558 MathSciNetCrossRef
2.
go back to reference Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Foundations of computer science. IEEE, pp 322–331 Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Foundations of computer science. IEEE, pp 322–331
3.
go back to reference Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econom J Econom Soc 55:1–18 MathSciNetMATH Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econom J Econom Soc 55:1–18 MathSciNetMATH
4.
go back to reference Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning, pp 551–560 Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning, pp 551–560
5.
go back to reference Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. In: Advances in neural information processing systems, vol 33 Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. In: Advances in neural information processing systems, vol 33
6.
go back to reference Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840 MathSciNetCrossRef Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840 MathSciNetCrossRef
7.
go back to reference Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424 MathSciNetCrossRef Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424 MathSciNetCrossRef
8.
go back to reference Bubeck S et al (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8(3–4):231–357 Bubeck S et al (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8(3–4):231–357
9.
go back to reference Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge CrossRef Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge CrossRef
10.
go back to reference Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J ACM 56(3):1–57 MathSciNetCrossRef Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J ACM 56(3):1–57 MathSciNetCrossRef
11.
go back to reference Claus C (1998) Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI conference on artificial intelligence, vol 746–752, p 2 Claus C (1998) Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI conference on artificial intelligence, vol 746–752, p 2
12.
go back to reference Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J Comput 39(1):195–259 MathSciNetCrossRef Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J Comput 39(1):195–259 MathSciNetCrossRef
13.
go back to reference Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. In: Advances in neural information processing systems, vol 33 Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. In: Advances in neural information processing systems, vol 33
14.
go back to reference Fang H, Harvey N, Portella V, Friedlander M (2020) Online mirror descent and dual averaging: keeping pace in the dynamic case. In: International conference on machine learning, pp 3008–3017 Fang H, Harvey N, Portella V, Friedlander M (2020) Online mirror descent and dual averaging: keeping pace in the dynamic case. In: International conference on machine learning, pp 3008–3017
15.
go back to reference Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer, Berlin MATH Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer, Berlin MATH
16.
go back to reference Greenwald A, Hall K (2003) Correlated-Q learning. In: International conference on machine learning, pp 242–249 Greenwald A, Hall K (2003) Correlated-Q learning. In: International conference on machine learning, pp 242–249
17.
go back to reference Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150 MathSciNetCrossRef Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150 MathSciNetCrossRef
18.
go back to reference Ho YC (1980) Team decision theory and information structures. Proc IEEE 68(6):644–654 CrossRef Ho YC (1980) Team decision theory and information structures. Proc IEEE 68(6):644–654 CrossRef
19.
go back to reference Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069 MathSciNetMATH Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069 MathSciNetMATH
20.
go back to reference Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11(4):1563–1600 MathSciNetMATH Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11(4):1563–1600 MathSciNetMATH
21.
go back to reference Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? In: International conference on neural information processing systems, pp 4868–4878 Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? In: International conference on neural information processing systems, pp 4868–4878
22.
go back to reference Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274 CrossRef Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274 CrossRef
23.
24.
go back to reference Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning, pp 157–163 Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning, pp 157–163
25.
go back to reference Littman ML (2001) Friend-or-Foe Q-learning in general-sum games. In: International conference on machine learning, pp 322–328 Littman ML (2001) Friend-or-Foe Q-learning in general-sum games. In: International conference on machine learning, pp 322–328
26.
go back to reference Littman ML, Szepesvári C (1996) A generalized reinforcement-learning model: convergence and applications. In: International conference on machine learning, pp 310–318 Littman ML, Szepesvári C (1996) A generalized reinforcement-learning model: convergence and applications. In: International conference on machine learning, pp 310–318
27.
go back to reference Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. In: International conference on machine learning. PMLR, pp 7001–7010 Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. In: International conference on machine learning. PMLR, pp 7001–7010
28.
go back to reference Moulin H, Vial JP (1978) Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon. Int J Game Theory 7(3–4):201–221 MathSciNetCrossRef Moulin H, Vial JP (1978) Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon. Int J Game Theory 7(3–4):201–221 MathSciNetCrossRef
29.
go back to reference Nayyar A, Gupta A, Langbort C, Başar T (2013) Common information based Markov perfect equilibria for stochastic games with asymmetric information: finite games. IEEE Trans Autom Control 59(3):555–570 MathSciNetCrossRef Nayyar A, Gupta A, Langbort C, Başar T (2013) Common information based Markov perfect equilibria for stochastic games with asymmetric information: finite games. IEEE Trans Autom Control 59(3):555–570 MathSciNetCrossRef
30.
go back to reference Nayyar A, Mahajan A, Teneketzis D (2013) Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans Autom Control 58(7):1644–1658 MathSciNetCrossRef Nayyar A, Mahajan A, Teneketzis D (2013) Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans Autom Control 58(7):1644–1658 MathSciNetCrossRef
31.
go back to reference Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley-Interscience, New York MATH Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley-Interscience, New York MATH
32.
go back to reference Neu G (2015) Explore no more: improved high-probability regret bounds for non-stochastic bandits. Adv Neural Inf Process Syst 28:3168–3176 Neu G (2015) Explore no more: improved high-probability regret bounds for non-stochastic bandits. Adv Neural Inf Process Syst 28:3168–3176
34.
35.
go back to reference Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In: Artificial intelligence and statistics. PMLR, pp 232–241 Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In: Artificial intelligence and statistics. PMLR, pp 232–241
36.
go back to reference Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International conference on autonomous agents and multiagent systems, pp 1371–1379 Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International conference on autonomous agents and multiagent systems, pp 1371–1379
37.
39.
go back to reference Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: International conference on artificial intelligence and statistics. PMLR, pp 2992–3002 Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: International conference on artificial intelligence and statistics. PMLR, pp 2992–3002
40.
go back to reference Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489 CrossRef Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489 CrossRef
41.
go back to reference Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. In: International conference on machine learning Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. In: International conference on machine learning
42.
go back to reference Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354 CrossRef Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354 CrossRef
43.
go back to reference Wang X, Sandholm T (2002) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Adv Neural Inf Process Syst 15:1603–1610 Wang X, Sandholm T (2002) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Adv Neural Inf Process Syst 15:1603–1610
44.
go back to reference Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. In: International conference on neural information processing systems, pp 4994–5004 Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. In: International conference on neural information processing systems, pp 4994–5004
45.
go back to reference Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. In: Annual conference on learning theory Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. In: Annual conference on learning theory
46.
go back to reference Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In: Conference on learning theory, pp 3674–3682 Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In: Conference on learning theory, pp 3674–3682
47.
go back to reference Yongacoglu B, Arslan G, Yüksel S (2019) Learning team-optimality for decentralized stochastic control and dynamic games. arXiv preprint arXiv:​1903.​05812 Yongacoglu B, Arslan G, Yüksel S (2019) Learning team-optimality for decentralized stochastic control and dynamic games. arXiv preprint arXiv:​1903.​05812
48.
go back to reference Yüksel S, Başar T (2013) Stochastic networked control systems: stabilization and optimization under information constraints. Springer, Berlin CrossRef Yüksel S, Başar T (2013) Stochastic networked control systems: stabilization and optimization under information constraints. Springer, Berlin CrossRef
50.
go back to reference Zhao Y, Tian Y, Lee JD, Du SS (2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. arXiv preprint arXiv:​2102.​08903 Zhao Y, Tian Y, Lee JD, Du SS (2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. arXiv preprint arXiv:​2102.​08903
51.
go back to reference Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: International conference on machine learning, pp 928–936 Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: International conference on machine learning, pp 928–936
Metadata
Title
Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
Authors
Weichao Mao
Tamer Başar
Publication date
05-01-2022
Publisher
Springer US
Published in
Dynamic Games and Applications
Print ISSN: 2153-0785
Electronic ISSN: 2153-0793
DOI
https://doi.org/10.1007/s13235-021-00420-0

Premium Partner