main-content

05-01-2022

# Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

Authors: Weichao Mao, Tamer Başar

Published in: Dynamic Games and Applications

## Abstract

This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents’ strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an $$\epsilon$$-approximate CCE in at most $$\widetilde{O}( H^6S A /\epsilon ^2)$$ episodes, where S is the number of states, A is the size of the largest individual action space, and H is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.

### Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

• über 69.000 Bücher
• über 500 Zeitschriften

aus folgenden Fachgebieten:

• Automobil + Motoren
• Bauwesen + Immobilien
• Elektrotechnik + Elektronik
• Energie + Nachhaltigkeit
• Finance + Banking
• Management + Führung
• Marketing + Vertrieb
• Maschinenbau + Werkstoffe
• Versicherung + Risiko

Testen Sie jetzt 15 Tage kostenlos.

### Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

• über 50.000 Bücher
• über 380 Zeitschriften

aus folgenden Fachgebieten:

• Automobil + Motoren
• Bauwesen + Immobilien
• Elektrotechnik + Elektronik
• Energie + Nachhaltigkeit
• Maschinenbau + Werkstoffe

Testen Sie jetzt 15 Tage kostenlos.

### Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

• über 58.000 Bücher
• über 300 Zeitschriften

aus folgenden Fachgebieten:

• Bauwesen + Immobilien
• Finance + Banking
• Management + Führung
• Marketing + Vertrieb
• Versicherung + Risiko

Testen Sie jetzt 15 Tage kostenlos.

Footnotes
1
This setting has been studied under various names in the literature, including individual learning [23], decentralized learning [1], agnostic learning [41, 45], and independent learning [11, 13]. It also belongs to a more general category of teams/games with decentralized information structure [18, 29, 30].

2
A general policy [5] of agent i is a set of maps $$\pi ^i = \{\pi ^i_h : \mathbb {R}\times (\mathcal {S}\times \mathcal {A}^i\times \mathbb {R})^{h-1}\times \mathcal {S}\rightarrow \Delta (\mathcal {A}^i)\}_{h\in [H]}$$ from a random variable $$z\in \mathbb {R}$$ and a history of length $$h-1$$ to a distribution of actions in $$\mathcal {A}^i$$.

Literature
1.
Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans Autom Control 62(4):1545–1558
2.
Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Foundations of computer science. IEEE, pp 322–331
3.
Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econom J Econom Soc 55:1–18
4.
Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. In: International conference on machine learning, pp 551–560
5.
Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. In: Advances in neural information processing systems, vol 33
6.
Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840
7.
Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424
8.
Bubeck S et al (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8(3–4):231–357
9.
Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge CrossRef
10.
Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J ACM 56(3):1–57
11.
Claus C (1998) Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI conference on artificial intelligence, vol 746–752, p 2
12.
Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J Comput 39(1):195–259
13.
Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. In: Advances in neural information processing systems, vol 33
14.
Fang H, Harvey N, Portella V, Friedlander M (2020) Online mirror descent and dual averaging: keeping pace in the dynamic case. In: International conference on machine learning, pp 3008–3017
15.
Filar J, Vrieze K (2012) Competitive Markov decision processes. Springer, Berlin MATH
16.
Greenwald A, Hall K (2003) Correlated-Q learning. In: International conference on machine learning, pp 242–249
17.
Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150
18.
Ho YC (1980) Team decision theory and information structures. Proc IEEE 68(6):644–654 CrossRef
19.
Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069
20.
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11(4):1563–1600
21.
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? In: International conference on neural information processing systems, pp 4868–4878
22.
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274 CrossRef
23.
Leslie DS, Collins EJ (2005) Individual Q-learning in normal form games. SIAM J Control Optim 44(2):495–514
24.
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning, pp 157–163
25.
Littman ML (2001) Friend-or-Foe Q-learning in general-sum games. In: International conference on machine learning, pp 322–328
26.
Littman ML, Szepesvári C (1996) A generalized reinforcement-learning model: convergence and applications. In: International conference on machine learning, pp 310–318
27.
Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. In: International conference on machine learning. PMLR, pp 7001–7010
28.
Moulin H, Vial JP (1978) Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon. Int J Game Theory 7(3–4):201–221
29.
Nayyar A, Gupta A, Langbort C, Başar T (2013) Common information based Markov perfect equilibria for stochastic games with asymmetric information: finite games. IEEE Trans Autom Control 59(3):555–570
30.
Nayyar A, Mahajan A, Teneketzis D (2013) Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans Autom Control 58(7):1644–1658
31.
Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley-Interscience, New York MATH
32.
Neu G (2015) Explore no more: improved high-probability regret bounds for non-stochastic bandits. Adv Neural Inf Process Syst 28:3168–3176
33.
Orabona F, Pál D (2018) Scale-free online learning. Theor Comput Sci 716:50–69
34.
Papadimitriou CH, Roughgarden T (2008) Computing correlated equilibria in multi-player games. J ACM 55(3):1–29
35.
Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In: Artificial intelligence and statistics. PMLR, pp 232–241
36.
Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International conference on autonomous agents and multiagent systems, pp 1371–1379
37.
Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:​1610.​03295
38.
Shapley LS (1953) Stochastic games. Proc Natl Acad Sci 39(10):1095–1100
39.
Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: International conference on artificial intelligence and statistics. PMLR, pp 2992–3002
40.
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489 CrossRef
41.
Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. In: International conference on machine learning
42.
Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354 CrossRef
43.
Wang X, Sandholm T (2002) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Adv Neural Inf Process Syst 15:1603–1610
44.
Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. In: International conference on neural information processing systems, pp 4994–5004
45.
Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. In: Annual conference on learning theory
46.
Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In: Conference on learning theory, pp 3674–3682
47.
Yongacoglu B, Arslan G, Yüksel S (2019) Learning team-optimality for decentralized stochastic control and dynamic games. arXiv preprint arXiv:​1903.​05812
48.
Yüksel S, Başar T (2013) Stochastic networked control systems: stabilization and optimization under information constraints. Springer, Berlin CrossRef
49.
Zehfroosh A, Tanner HG (2020) PAC reinforcement learning algorithm for general-sum Markov games. arXiv preprint arXiv:​2009.​02605
50.
Zhao Y, Tian Y, Lee JD, Du SS (2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. arXiv preprint arXiv:​2102.​08903
51.
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: International conference on machine learning, pp 928–936
Title
Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games
Authors
Weichao Mao
Tamer Başar
Publication date
05-01-2022
Publisher
Springer US
Published in
Dynamic Games and Applications
Print ISSN: 2153-0785
Electronic ISSN: 2153-0793
DOI
https://doi.org/10.1007/s13235-021-00420-0