Skip to main content
Erschienen in: Dynamic Games and Applications 1/2023

21.01.2023

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

verfasst von: Jayakumar Subramanian, Amit Sinha, Aditya Mahajan

Erschienen in: Dynamic Games and Applications | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multi-agent reinforcement learning (MARL) is often modeled using the framework of Markov games (also called stochastic games or dynamic games). Most of the existing literature on MARL concentrates on zero-sum Markov games but is not applicable to general-sum Markov games. It is known that the best response dynamics in general-sum Markov games are not a contraction. Therefore, different equilibria in general-sum Markov games can have different values. Moreover, the Q-function is not sufficient to completely characterize the equilibrium. Given these challenges, model-based learning is an attractive approach for MARL in general-sum Markov games. In this paper, we investigate the fundamental question of sample complexity for model-based MARL algorithms in general-sum Markov games. We show two results. We first use Hoeffding inequality-based bounds to show that \(\tilde{{\mathcal {O}}}( (1-\gamma )^{-4} \alpha ^{-2})\) samples per state–action pair are sufficient to obtain a \(\alpha \)-approximate Markov perfect equilibrium with high probability, where \(\gamma \) is the discount factor, and the \(\tilde{{\mathcal {O}}}(\cdot )\) notation hides logarithmic terms. We then use Bernstein inequality-based bounds to show that \(\tilde{{\mathcal {O}}}( (1-\gamma )^{-1} \alpha ^{-2} )\) samples are sufficient. To obtain these results, we study the robustness of Markov perfect equilibrium to model approximations. We show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game and provide explicit bounds on the approximation error. We illustrate the results via a numerical example.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
[65] construct two player general-sum games with the following properties. The game has two states: in state 1, player 1 has two actions and player 2 has one action; in state 2, player 1 has one action and player 2 has two actions. The transition probabilities are chosen such that there is a unique Markov perfect equilibrium in mixed strategies. This means that in state 1, both actions of player 1 maximize the Q-function; in state 2, both actions of player 2 minimize the Q-function. However, the Q-function in itself is insufficient to determine the randomizing probabilities for the mixed strategy MPE.
 
2
The plug-in estimator is also known as a certainty equivalent controller in the stochastic control literature.
 
3
If \(\mu \) and \(\nu \) are absolutely continuous with respect to some measure \(\lambda \) and let \(p = d\mu /d\lambda \) and \(q = d\nu /d\lambda \), then total variation is typically defined as \(\tfrac{1}{2} \int _{\mathcal X}| p(x) - q(x)| \lambda (dx)\). This is consistent with our definition. Let \({{\bar{f}}} = ( \sup f + \inf f)/2\). Then
$$\begin{aligned}&\left| \int _{\mathcal X} f d\mu - \int _{\mathcal X} f d\nu \right| = \left| \int _{\mathcal X} f(x) p(x) \lambda (dx) - \int _{\mathcal X} f(x) q(x) \lambda (dx) \right| \\&\quad = \left| \int _{\mathcal X} \bigl [ f(x) - {{\bar{f}}} \bigr ] \bigl [ p(x) - q(x) \bigr ] \lambda (dx) \right| \le \Vert f - {{\bar{f}}} \Vert _{\infty } \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx)\\&\quad \le \tfrac{1}{2} {{\,\textrm{span}\,}}(f) \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx). \end{aligned}$$
 
4
For consistency with the normalized rewards considered in the game formulation (see Remark 1), we use normalized rewards for MDPs as well. Although most of the literature on MDPs uses unnormalized rewards, normalized rewards are commonly used in the literature on constrained MDPs [6].
 
5
Recall that we are working with normalized total expected reward (see Remark 1), while the results [7] are derived for the unnormalized total reward. In the discussion above, we have normalized the results of [7].
 
Literatur
1.
Zurück zum Zitat Acemoglu D, Robinson JA (2001) A theory of political transitions. Am Econ Rev 91(4):938–963CrossRef Acemoglu D, Robinson JA (2001) A theory of political transitions. Am Econ Rev 91(4):938–963CrossRef
2.
Zurück zum Zitat Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR
4.
Zurück zum Zitat Akchurina N (2010) Multi-agent reinforcement learning algorithms. PhD thesis, University of Paderborn Akchurina N (2010) Multi-agent reinforcement learning algorithms. PhD thesis, University of Paderborn
6.
Zurück zum Zitat Altman E (1999) Constrained Markov decision processes: stochastic modeling. CRC Press, Boca RatonMATH Altman E (1999) Constrained Markov decision processes: stochastic modeling. CRC Press, Boca RatonMATH
7.
Zurück zum Zitat Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach Learn 91(3):325–349MathSciNetCrossRefMATH Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach Learn 91(3):325–349MathSciNetCrossRefMATH
8.
9.
Zurück zum Zitat Başar T, Bernhard P (2008) H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, NYCrossRefMATH Başar T, Bernhard P (2008) H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, NYCrossRefMATH
10.
Zurück zum Zitat Başar T, Zaccour G (2018) Handbook of dynamic game theory. Springer International Publishing, NYCrossRef Başar T, Zaccour G (2018) Handbook of dynamic game theory. Springer International Publishing, NYCrossRef
11.
Zurück zum Zitat Bertsekas DP (2017) Dynamic programming and optimal control. Athena Scientific, Belmont, MAMATH Bertsekas DP (2017) Dynamic programming and optimal control. Athena Scientific, Belmont, MAMATH
13.
Zurück zum Zitat Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 38(2):156–172CrossRef Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 38(2):156–172CrossRef
14.
Zurück zum Zitat Cesa-Bianch N, Lugosi G (2006) Prediction, learning, and games. Cambridge university press, CambridgeCrossRefMATH Cesa-Bianch N, Lugosi G (2006) Prediction, learning, and games. Cambridge university press, CambridgeCrossRefMATH
16.
Zurück zum Zitat Doraszelski U, Escobar JF (2010) A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification. Theor Econ 5(3):369–402MathSciNetCrossRefMATH Doraszelski U, Escobar JF (2010) A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification. Theor Econ 5(3):369–402MathSciNetCrossRefMATH
17.
Zurück zum Zitat Ericson R, Pakes A (1995) Markov-perfect industry dynamics: a framework for empirical work. Rev Econ Stud 62(1):53–82CrossRefMATH Ericson R, Pakes A (1995) Markov-perfect industry dynamics: a framework for empirical work. Rev Econ Stud 62(1):53–82CrossRefMATH
18.
Zurück zum Zitat Fershtiman C, Pakes A (2000) A dynamic oligopoly with collusion and price wars. RAND J Econ 31(2):207–236CrossRef Fershtiman C, Pakes A (2000) A dynamic oligopoly with collusion and price wars. RAND J Econ 31(2):207–236CrossRef
19.
Zurück zum Zitat Filar J, Vrieze K (1996) Competitive Markov Decision Processes. Springer, New York, NY. 978-1-4612-8481-9 978-1-4612-4054-9 Filar J, Vrieze K (1996) Competitive Markov Decision Processes. Springer, New York, NY. 978-1-4612-8481-9 978-1-4612-4054-9
20.
Zurück zum Zitat Filar JA, Schultz TA, Thuijsman F, Vrieze O (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(1):227–237MathSciNetCrossRefMATH Filar JA, Schultz TA, Thuijsman F, Vrieze O (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(1):227–237MathSciNetCrossRefMATH
22.
23.
Zurück zum Zitat Herings PJ-J, Peeters RJ et al (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60MathSciNetMATH Herings PJ-J, Peeters RJ et al (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60MathSciNetMATH
24.
Zurück zum Zitat Hinderer K (2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62 (1): 3–22. ISSN 1432-5217 Hinderer K (2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62 (1): 3–22. ISSN 1432-5217
27.
Zurück zum Zitat Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College, London Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College, London
28.
Zurück zum Zitat Kearns M, Singh S (1999) Finite-sample convergence rates for q-learning and indirect algorithms. Adv Neural Inf Process Syst 871:996–1002 Kearns M, Singh S (1999) Finite-sample convergence rates for q-learning and indirect algorithms. Adv Neural Inf Process Syst 871:996–1002
31.
Zurück zum Zitat Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv Neural Inf Process Syst 33:12861 Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv Neural Inf Process Syst 33:12861
32.
Zurück zum Zitat Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 157–163. Elsevier Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 157–163. Elsevier
33.
Zurück zum Zitat Littman ML (2001) Value-function reinforcement learning in Markov games. Cognit Syst Res 2(1):55–66CrossRef Littman ML (2001) Value-function reinforcement learning in Markov games. Cognit Syst Res 2(1):55–66CrossRef
34.
Zurück zum Zitat Mailath GJ, Samuelson L (2006) Repeated games and reputations: long-run relationships. Oxford University Press, OxfordCrossRef Mailath GJ, Samuelson L (2006) Repeated games and reputations: long-run relationships. Oxford University Press, OxfordCrossRef
35.
Zurück zum Zitat Maskin E, Tirole J (1988) A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica: J Econ Soc 549–569 Maskin E, Tirole J (1988) A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica: J Econ Soc 549–569
36.
Zurück zum Zitat Maskin E, Tirole J (1988) A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica: J Econ Soc 571–599 Maskin E, Tirole J (1988) A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica: J Econ Soc 571–599
38.
Zurück zum Zitat Müller A (1997) How does the value function of a Markov decision process depend on the transition probabilities? Math Op Res 22(4):872–885MathSciNetCrossRefMATH Müller A (1997) How does the value function of a Markov decision process depend on the transition probabilities? Math Op Res 22(4):872–885MathSciNetCrossRefMATH
39.
40.
Zurück zum Zitat Pakes A, Ostrovsky M, Berry S (2007) Simple estimators for the parameters of discrete dynamic games (with entry/exit examples). RAND J Econ 38(2):373–399CrossRef Pakes A, Ostrovsky M, Berry S (2007) Simple estimators for the parameters of discrete dynamic games (with entry/exit examples). RAND J Econ 38(2):373–399CrossRef
41.
Zurück zum Zitat Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In Artificial Intelligence and Statistics, pages 232–241. PMLR Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In Artificial Intelligence and Statistics, pages 232–241. PMLR
42.
43.
Zurück zum Zitat Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379 Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379
44.
Zurück zum Zitat Rogers PD (1969) Nonzero-sum stochastic games. PhD thesis, University of California, Berkeley Rogers PD (1969) Nonzero-sum stochastic games. PhD thesis, University of California, Berkeley
45.
Zurück zum Zitat Sengupta S, Chowdhary A, Huang D, Kambhampati S (2019) General sum markov games for strategic detection of advanced persistent threats using moving target defense in cloud networks. In International Conference on Decision and Game Theory for Security, pages 492–512. Springer Sengupta S, Chowdhary A, Huang D, Kambhampati S (2019) General sum markov games for strategic detection of advanced persistent threats using moving target defense in cloud networks. In International Conference on Decision and Game Theory for Security, pages 492–512. Springer
47.
Zurück zum Zitat Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University
48.
Zurück zum Zitat Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202 Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202
49.
Zurück zum Zitat Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR
50.
Zurück zum Zitat Solan E (2021) A course in stochastic game theory. Cambridge University Press, CambridgeMATH Solan E (2021) A course in stochastic game theory. Cambridge University Press, CambridgeMATH
52.
Zurück zum Zitat Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B (2008) Injective Hilbert space embeddings of probability measures. In Conference on Learning Theory, Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B (2008) Injective Hilbert space embeddings of probability measures. In Conference on Learning Theory,
53.
Zurück zum Zitat Subramanian J, Sinha A, Seraj R, Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J Mach Learn Res 23:1–12MathSciNetMATH Subramanian J, Sinha A, Seraj R, Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J Mach Learn Res 23:1–12MathSciNetMATH
54.
Zurück zum Zitat Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224. San Francisco (CA) Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224. San Francisco (CA)
55.
57.
Zurück zum Zitat Tidball MM, Pourtallier O, Altman E (1997) Approximations in dynamic zero-sum games II. SIAM J Control Optim 35(6):2101–2117MathSciNetCrossRefMATH Tidball MM, Pourtallier O, Altman E (1997) Approximations in dynamic zero-sum games II. SIAM J Control Optim 35(6):2101–2117MathSciNetCrossRefMATH
58.
Zurück zum Zitat Vrieze OJ (1987) Stochastic games with finite state and action spaces. CWI, Jan. ISBN 978-90-6196-313-4 Vrieze OJ (1987) Stochastic games with finite state and action spaces. CWI, Jan. ISBN 978-90-6196-313-4
60.
61.
Zurück zum Zitat Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent rl in zero-sum Markov games with near-optimal sample complexity. Adv Neural Inf Process Syst 33:1166 Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent rl in zero-sum Markov games with near-optimal sample complexity. Adv Neural Inf Process Syst 33:1166
62.
Zurück zum Zitat Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384 Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384
63.
Zurück zum Zitat Zhang R, Ren Z, Li N (2021) Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv e-prints, pages arXiv–2106 Zhang R, Ren Z, Li N (2021) Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv e-prints, pages arXiv–2106
64.
Zurück zum Zitat Zhang W, Wang X, Shen J, Zhou M (2021) Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. In International Joint Conference on Artificial Intelligence. Montreal, Canada Zhang W, Wang X, Shen J, Zhou M (2021) Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. In International Joint Conference on Artificial Intelligence. Montreal, Canada
65.
Zurück zum Zitat Zinkevich M, Greenwald A, Littman M (2006) Cyclic equilibria in Markov games. In Neural Information Processing Systems, pages 1641–1648 Zinkevich M, Greenwald A, Littman M (2006) Cyclic equilibria in Markov games. In Neural Information Processing Systems, pages 1641–1648
Metadaten
Titel
Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games
verfasst von
Jayakumar Subramanian
Amit Sinha
Aditya Mahajan
Publikationsdatum
21.01.2023
Verlag
Springer US
Erschienen in
Dynamic Games and Applications / Ausgabe 1/2023
Print ISSN: 2153-0785
Elektronische ISSN: 2153-0793
DOI
https://doi.org/10.1007/s13235-023-00490-2

Weitere Artikel der Ausgabe 1/2023

Dynamic Games and Applications 1/2023 Zur Ausgabe

Premium Partner