nach oben

Dynamic Games and Applications

Erschienen in:

21.01.2023

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

verfasst von: Jayakumar Subramanian, Amit Sinha, Aditya Mahajan

Erschienen in: Dynamic Games and Applications | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Multi-agent reinforcement learning (MARL) is often modeled using the framework of Markov games (also called stochastic games or dynamic games). Most of the existing literature on MARL concentrates on zero-sum Markov games but is not applicable to general-sum Markov games. It is known that the best response dynamics in general-sum Markov games are not a contraction. Therefore, different equilibria in general-sum Markov games can have different values. Moreover, the Q-function is not sufficient to completely characterize the equilibrium. Given these challenges, model-based learning is an attractive approach for MARL in general-sum Markov games. In this paper, we investigate the fundamental question of sample complexity for model-based MARL algorithms in general-sum Markov games. We show two results. We first use Hoeffding inequality-based bounds to show that $\tilde{{\mathcal {O}}}( (1-\gamma )^{-4} \alpha ^{-2})$ samples per state–action pair are sufficient to obtain a $\alpha $-approximate Markov perfect equilibrium with high probability, where $\gamma $ is the discount factor, and the $\tilde{{\mathcal {O}}}(\cdot )$ notation hides logarithmic terms. We then use Bernstein inequality-based bounds to show that $\tilde{{\mathcal {O}}}( (1-\gamma )^{-1} \alpha ^{-2} )$ samples are sufficient. To obtain these results, we study the robustness of Markov perfect equilibrium to model approximations. We show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game and provide explicit bounds on the approximation error. We illustrate the results via a numerical example.

Vorheriger Artikel Multi-Agent Natural Actor-Critic Reinforcement Learning Algorithms

Nächster Artikel Q-Learning in Regularized Mean-field Games

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

[65] construct two player general-sum games with the following properties. The game has two states: in state 1, player 1 has two actions and player 2 has one action; in state 2, player 1 has one action and player 2 has two actions. The transition probabilities are chosen such that there is a unique Markov perfect equilibrium in mixed strategies. This means that in state 1, both actions of player 1 maximize the Q-function; in state 2, both actions of player 2 minimize the Q-function. However, the Q-function in itself is insufficient to determine the randomizing probabilities for the mixed strategy MPE.

The plug-in estimator is also known as a certainty equivalent controller in the stochastic control literature.

If $\mu $ and $\nu $ are absolutely continuous with respect to some measure $\lambda $ and let $p = d\mu /d\lambda $ and $q = d\nu /d\lambda $, then total variation is typically defined as $\tfrac{1}{2} \int _{\mathcal X}| p(x) - q(x)| \lambda (dx)$. This is consistent with our definition. Let ${{\bar{f}}} = ( \sup f + \inf f)/2$. Then

$$\begin{aligned}&\left| \int _{\mathcal X} f d\mu - \int _{\mathcal X} f d\nu \right| = \left| \int _{\mathcal X} f(x) p(x) \lambda (dx) - \int _{\mathcal X} f(x) q(x) \lambda (dx) \right| \\&\quad = \left| \int _{\mathcal X} \bigl [ f(x) - {{\bar{f}}} \bigr ] \bigl [ p(x) - q(x) \bigr ] \lambda (dx) \right| \le \Vert f - {{\bar{f}}} \Vert _{\infty } \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx)\\&\quad \le \tfrac{1}{2} {{\,\textrm{span}\,}}(f) \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx). \end{aligned}$$

For consistency with the normalized rewards considered in the game formulation (see Remark 1), we use normalized rewards for MDPs as well. Although most of the literature on MDPs uses unnormalized rewards, normalized rewards are commonly used in the literature on constrained MDPs [6].

Recall that we are working with normalized total expected reward (see Remark 1), while the results [7] are derived for the unnormalized total reward. In the discussion above, we have normalized the results of [7].

Acemoglu D, Robinson JA (2001) A theory of political transitions. Am Econ Rev 91(4):938–963CrossRef

Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR

Aguirregabiria V, Mira P (2007) Sequential estimation of dynamic discrete games. Econometrica 75(1):1–53MathSciNetCrossRefMATH

Akchurina N (2010) Multi-agent reinforcement learning algorithms. PhD thesis, University of Paderborn

Albright SC, Winston W (1979) A birth-death model of advertising and pricing. Adv Appl Probab 11(1):134–152MathSciNetCrossRefMATH

Altman E (1999) Constrained Markov decision processes: stochastic modeling. CRC Press, Boca RatonMATH

Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach Learn 91(3):325–349MathSciNetCrossRefMATH

Bajari P, Benkard CL, Levin J (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370MathSciNetCrossRefMATH

Başar T, Bernhard P (2008) H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, NYCrossRefMATH

10.

Başar T, Zaccour G (2018) Handbook of dynamic game theory. Springer International Publishing, NYCrossRef

11.

Bertsekas DP (2017) Dynamic programming and optimal control. Athena Scientific, Belmont, MAMATH

12.

Breton M (1991) Algorithms for stochastic games. Springer, NYCrossRefMATH

13.

Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 38(2):156–172CrossRef

14.

Cesa-Bianch N, Lugosi G (2006) Prediction, learning, and games. Cambridge university press, CambridgeCrossRefMATH

15.

Deng X, Li Y, Mguni DH, Wang J, Yang Y (2021) On the complexity of computing markov perfect equilibrium in general-sum stochastic games. http://arxiv.org/abs/2109.01795

16.

Doraszelski U, Escobar JF (2010) A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification. Theor Econ 5(3):369–402MathSciNetCrossRefMATH

17.

Ericson R, Pakes A (1995) Markov-perfect industry dynamics: a framework for empirical work. Rev Econ Stud 62(1):53–82CrossRefMATH

18.

Fershtiman C, Pakes A (2000) A dynamic oligopoly with collusion and price wars. RAND J Econ 31(2):207–236CrossRef

19.

Filar J, Vrieze K (1996) Competitive Markov Decision Processes. Springer, New York, NY. 978-1-4612-8481-9 978-1-4612-4054-9

20.

Filar JA, Schultz TA, Thuijsman F, Vrieze O (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(1):227–237MathSciNetCrossRefMATH

21.

Fink AM (1964) Equilibrium in a stochastic n-person game. Hiroshima Math J 28:1MathSciNetCrossRefMATH

22.

Herings PJ-J, Peeters R (2010) Homotopy methods to compute equilibria in game theory. Econ Theory 42(1):119–156MathSciNetCrossRefMATH

23.

Herings PJ-J, Peeters RJ et al (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60MathSciNetMATH

24.

Hinderer K (2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62 (1): 3–22. ISSN 1432-5217

25.

Hoffman AJ, Karp RM (1966) On nonterminating stochastic games. Manage Sci 12(5):359–370MathSciNetCrossRefMATH

26.

Jaśkiewicz A, Nowak AS (2014) Robust Markov perfect equilibria. J Math Anal Appl 419(2):1322–1332MathSciNetCrossRefMATH

27.

Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College, London

28.

Kearns M, Singh S (1999) Finite-sample convergence rates for q-learning and indirect algorithms. Adv Neural Inf Process Syst 871:996–1002

29.

Krupnik O, Mordatch I, Tamar A (2019) Multi-agent reinforcement learning with multi-step generative models. http://arxiv.org/abs/1901.10251

30.

Leonardos S, Overman W, Panageas I, Piliouras G (2021) Global convergence of multi-agent policy gradient in markov potential games. http://arxiv.org/abs/2106.01969

31.

Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv Neural Inf Process Syst 33:12861

32.

Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 157–163. Elsevier

33.

Littman ML (2001) Value-function reinforcement learning in Markov games. Cognit Syst Res 2(1):55–66CrossRef

34.

Mailath GJ, Samuelson L (2006) Repeated games and reputations: long-run relationships. Oxford University Press, OxfordCrossRef

35.

Maskin E, Tirole J (1988) A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica: J Econ Soc 549–569

36.

Maskin E, Tirole J (1988) A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica: J Econ Soc 571–599

37.

Maskin E, Tirole J (2001) Markov perfect equilibrium: I. observable actions. J Econ Theory 100(2):191–219MathSciNetCrossRefMATH

38.

Müller A (1997) How does the value function of a Markov decision process depend on the transition probabilities? Math Op Res 22(4):872–885MathSciNetCrossRefMATH

39.

Müller A (1997) Integral probability metrics and their generating classes of functions. Adv Appl Probab 29(2):429–443MathSciNetCrossRefMATH

40.

Pakes A, Ostrovsky M, Berry S (2007) Simple estimators for the parameters of discrete dynamic games (with entry/exit examples). RAND J Econ 38(2):373–399CrossRef

41.

Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In Artificial Intelligence and Statistics, pages 232–241. PMLR

42.

Pesendorfer M, Schmidt-Dengler P (2008) Asymptotic least squares estimators for dynamic games. Rev Econ Stud 75(3):901–928MathSciNetCrossRefMATH

43.

Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379

44.

Rogers PD (1969) Nonzero-sum stochastic games. PhD thesis, University of California, Berkeley

45.

Sengupta S, Chowdhary A, Huang D, Kambhampati S (2019) General sum markov games for strategic detection of advanced persistent threats using moving target defense in cloud networks. In International Conference on Decision and Game Theory for Security, pages 492–512. Springer

46.

Shapley LS (1953) Stochastic games. Proc Nat Acad Sci 39(10):1095–1100MathSciNetCrossRefMATH

47.

Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University

48.

Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202

49.

Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR

50.

Solan E (2021) A course in stochastic game theory. Cambridge University Press, CambridgeMATH

51.

Song Z, Mei S, Bai Y (2021) When can we learn general-sum markov games with a large number of players sample-efficiently? http://arxiv.org/abs/2110.04184

52.

Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B (2008) Injective Hilbert space embeddings of probability measures. In Conference on Learning Theory,

53.

Subramanian J, Sinha A, Seraj R, Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J Mach Learn Res 23:1–12MathSciNetMATH

54.

Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224. San Francisco (CA)

55.

Takahashi M (1964) Equilibrium points of stochastic non-cooperative $n$-person games. Hiroshima Math J 28(1):95MathSciNetCrossRefMATH

56.

Tidball MM, Altman E (1996) Approximations in dynamic zero-sum games I. SIAM J Control Optim 34(1):311–328MathSciNetCrossRefMATH

57.

Tidball MM, Pourtallier O, Altman E (1997) Approximations in dynamic zero-sum games II. SIAM J Control Optim 35(6):2101–2117MathSciNetCrossRefMATH

58.

Vrieze OJ (1987) Stochastic games with finite state and action spaces. CWI, Jan. ISBN 978-90-6196-313-4

59.

Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J (2019) Benchmarking Model-Based Reinforcement Learning. http://arxiv.org/abs/1907.02057

60.

Whitt W (1980) Representation and approximation of noncooperative sequential games. SIAM J Control Optim 18(1):33–48MathSciNetCrossRefMATH

61.

Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent rl in zero-sum Markov games with near-optimal sample complexity. Adv Neural Inf Process Syst 33:1166

62.

Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384

63.

Zhang R, Ren Z, Li N (2021) Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv e-prints, pages arXiv–2106

64.

Zhang W, Wang X, Shen J, Zhou M (2021) Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. In International Joint Conference on Artificial Intelligence. Montreal, Canada

65.

Zinkevich M, Greenwald A, Littman M (2006) Cyclic equilibria in Markov games. In Neural Information Processing Systems, pages 1641–1648

Titel: Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games
verfasst von: Jayakumar Subramanian
Amit Sinha
Aditya Mahajan
Publikationsdatum: 21.01.2023
Verlag: Springer US
Erschienen in: Dynamic Games and Applications / Ausgabe 1/2023
Print ISSN: 2153-0785
Elektronische ISSN: 2153-0793
DOI: https://doi.org/10.1007/s13235-023-00490-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2023

Q-Learning in Regularized Mean-field Games

Robust Utility Design in Distributed Resource Allocation Problems with Defective Agents

Opinion Dynamics Control in a Social Network with a Communication Structure

Dynamic Games Among Teams with Delayed Intra-Team Information Sharing

Coordinated Replenishment Game and Learning Under Time Dependency and Uncertainty of the Parameters

Multi-Agent Natural Actor-Critic Reinforcement Learning Algorithms

Premium Partner