nach oben

International Journal of Machine Learning and Cybernetics

Erschienen in:

30.09.2023 | Original Article

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

verfasst von: Li-yang Zhao, Tian-qing Chang, Lei Zhang, Xin-lu Zhang, Jiang-feng Wang

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Multi-agent cooperation and coordination are often essential for task fulfillment. Multi-agent deep reinforcement learning (MADRL) can effectively learn the solutions to problems, but its application is still primarily restricted by the exploration–exploitation trade-off. Therefore, the focus of MADRL research is placed on how to effectively explore the environment and collect good experience with rich information to strengthen cooperative behaviors and optimize policy learning. To address this problem, we propose a novel multi-agent cooperation policy gradient method called multi-agent proximal policy optimization based on self-imitation learning and random network distillation (MAPPOSR). MAPPOSR consists of two policy gradient-based additional components, namely (1) random network distillation (RND) exploration bonus component that produces intrinsic rewards and encourages agents to access new states and actions, thereby helping them explore better trajectories and avoiding the algorithm prematurely converging or getting stuck in local optima; and (2) self-imitation learning (SIL) policy update component that stores and reuses high-return trajectory samples generated by agents themselves, thereby strengthening their cooperation and boosting learning efficiency. The experimental results show that in addition to effectively solving the hard-exploration problem, the proposed method significantly outperforms other SOTA MADRL algorithms in learning efficiency as well as in escaping local optima. Moreover, the effect of different function inputs on algorithm performance is investigated in the centralized training and decentralized execution (CTDE) framework, based on which a joint-observation coding method based on individual is developed. By encouraging the agent to focus more on the local observation information of other agents related to it and abandon global state information provided by the environment, the developed coding method can remove the effects of excessive value function input dimensions and redundant feature information on algorithm performance.

Vorheriger Artikel Subspace clustering based on a multichannel attention mechanism

Nächster Artikel Incremental feature selection based on uncertainty measure for dynamic interval-valued data

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information.

Order your 30-days-trial for free and without any commitment.

Jetzt informieren

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik.

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

Jetzt informieren

Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge

Li Q, Peng H, Li J, Wu J, Ning Y, Wang L, Yu PS, Wang Z (2021) Reinforcement learning-based dialogue guided event extraction to exploit argument relations. IEEE/ACM Trans Audio Speech Lang Process 30:520–533ADSCrossRef

Peng B, Rashid T, Schroeder de Witt C, Kamienny P-A, Torr P, Böhmer W, Whiteson S (2021) Facmac: factored multi-agent centralised policy gradients. Adv Neural Inf Process Syst 34:12208–12221

Gupta JK, Egorov M, Kochenderfer MJ (2017) Cooperative multi-agent control using deep reinforcement learning. AAMAS Workshops 30:66–83

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533ADSCrossRefPubMed

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489ADSCrossRefPubMed

Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359ADSCrossRefPubMed

Bolander T, Andersen MB (2011) Epistemic planning for single-and multi-agent systems. J Appl Non Class Logics 21(1):9–34MathSciNetCrossRef

Du W, Ding S (2021) A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artif Intell Rev 54(5):3215–3238CrossRef

10.

Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agent Multi Agent Syst 33(6):750–797CrossRef

11.

Wang J, Hong Y, Wang J, Xu J, Tang Y, Han Q-L, Kurths J (2022) Cooperative and competitive multi-agent systems: from optimization to games. IEEE/CAA J Autom Sin 9(5):763–783CrossRef

12.

Du Y, Han L, Fang M, Liu J, Dai T, Tao D (2019) Liir: learning individual intrinsic reward in multi-agent reinforcement learning. Adv Neural Inf Process Syst 32

13.

Wang T, Wang J, Wu Y, Zhang C (2020) Influence-based multi-agent exploration. International conference on learning representations

14.

Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) MAVEN: multi-agent variational exploration. Neural Inf Process Syst 32:7611–7622

15.

Yang T, Tang H, Bai C, Liu J, Hao J, Meng Z et al (2023) Exploration in deep reinforcement learning: from single-agent to multiagent domain. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3236361

16.

Liu I-J, Jain U, Yeh RA, Schwing A (2021) Cooperative exploration for multi-agent deep reinforcement learning. In: International conference on machine learning. PMLR, pp 6826–6836

17.

Hussein A, Gaber MM, Elyan E, Jayne C (2017) Imitation learning: a survey of learning methods. ACM Comput Surv (CSUR) 50(2):1–35CrossRef

18.

Ambhore S (2020) A comprehensive study on robot learning from demonstration. In: 2020 2nd international conference on innovative mechanisms for industry applications (ICIMIA). IEEE, pp 291–299

19.

Ravichandar H, Polydoros AS, Chernova S, Billard A (2020) Recent advances in robot learning from demonstration. Ann Rev Control Robot Auton Syst 3:297–330CrossRef

20.

Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: International conference on machine learning. PMLR, pp 3878–3887

21.

Guo Y, Oh J, Singh S, Lee H (2018) Generative adversarial self-imitation learning. arXiv preprint arXiv:1812.00950

22.

Osa T, Pajarinen J, Neumann G, Bagnell JA, Abbeel P, Peters J (2018) An algorithmic perspective on imitation learning. Found Trends Robot 7(1–2):1–179

23.

Pomerleau DA (1991) Efficient training of artificial neural networks for autonomous navigation. Neural Comput 3(1):88–97CrossRefPubMed

24.

Bain M, Sammut C (1995) A framework for behavioural cloning. Mach Intell 15:103–129

25.

Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR, pp 627–635

26.

Sun W, Venkatraman A, Gordon GJ, Boots B, Bagnell JA (2017) Deeply aggrevated: differentiable imitation learning for sequential prediction. In: International conference on machine learning. PMLR, pp 3309–3318

27.

Russell S (1998) Learning agents for uncertain environments. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, pp 101–103

28.

Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first international conference on Machine learning. PMLR, pp 1–8

29.

Syed U, Schapire RE (2007) A game-theoretic approach to apprenticeship learning. Adv Neural Inf Process Syst 20. pp 1–8

30.

Ho J, Ermon S (2016) Generative adversarial imitation learning. Adv Neural Inf Process Syst 29. pp 1–9

31.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144MathSciNetCrossRef

32.

Ziebart BD, Maas AL, Bagnell JA, Dey AK (2008) Maximum entropy inverse reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence. AAAI, pp 1433–1438

33.

Zhang Y, Cai Q, Yang Z, Wang Z (2020) Generative adversarial imitation learning with neural network parameterization: global optimality and convergence rate. In: International conference on machine learning. PMLR, pp 11044–11054

34.

Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483CrossRef

35.

Gao Y, Xu H, Lin J, Yu F, Levine S, Darrell T (2018) Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313

36.

Jing M, Ma X, Huang W, Sun F, Yang C, Fang B et al (2020) Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI conference on artificial intelligence. AAAI, pp 5109–5116

37.

Kang B, Jie Z, Feng J (2018) Policy optimization with demonstrations. In: International conference on machine learning. PMLR, pp 2469–2478

38.

Vecerik M, Hester T, Scholz J, Wang F, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817

39.

Pshikhachev G, Ivanov D, Egorov V, Shpilman A (2022) Self-imitation learning from demonstrations. arXiv preprint arXiv:2203.10905

40.

Guo Y, Choi J, Moczulski M, Feng S, Bengio S, Norouzi M, Lee H (2020) Memory based trajectory-conditioned policies for learning from sparse rewards. Adv Neural Inf Process Syst 33:4333–4345

41.

Gangwani T, Liu Q, Peng J (2018) Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309

42.

Tang Y (2020) Self-imitation learning via generalized lower bound q-learning. Adv Neural Inf Process Syst 33:13964–13975

43.

Badia AP, Sprechmann P, Vitvitskyi A, Guo D, Piot B, Kapturowski S, Tieleman O, Arjovsky M, Pritzel A, Bolt A (2020) Never give up: learning directed exploration strategies. arXiv preprint arXiv:2002.06038

44.

Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995

45.

Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2021) First return, then explore. Nature 590(7847):580–586ADSCrossRefPubMed

46.

Guo ZD, Brunskill E (2019) Directed exploration for reinforcement learning. arXiv preprint arXiv:1906.07805

47.

Savinov N, Raichuk A, Marinier R, Vincent D, Pollefeys M, Lillicrap T, Gelly S (2018) Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274

48.

Oudeyer P-Y, Kaplan F (2008) How can we define intrinsic motivation? In: The 8th international conference on epigenetic robotics: modeling cognitive development in robotic systems. Lund:LUCS, pp 1–10

49.

Tang H, Houthooft R, Foote D, Stooke A, Xi Chen O, Duan Y, Schulman J, DeTurck F, Abbeel P (2017) # exploration: a study of count-based exploration for deep reinforcement learning. Adv Neural Inf Process Syst (30):1–10

50.

Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. Adv Neural Inf Process Syst (29):1–9

51.

Ostrovski G, Bellemare MG, Oord A, Munos R (2017) Count-based exploration with neural density models. In: International conference on machine learning. PMLR, pp 2721–2730

52.

Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: International conference on machine learning. PMLR, pp 2778–2787

53.

Oudeyer P-Y, Kaplan F, Hafner VV (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286CrossRef

54.

Zhao R, Tresp V (2019) Curiosity-driven experience prioritization via density estimation. arXiv preprint arXiv:1902.08039

55.

Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814

56.

Burda Y, Edwards H, Pathak D, Storkey A, Darrell T, Efros AA (2018) Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355

57.

Choshen L, Fox L, Loewenstein Y (2018) Dora the explorer: directed outreaching reinforcement action-selection. arXiv preprint arXiv:1804.04012

58.

Pathak D, Gandhi D, Gupta A (2019) Self-supervised exploration via disagreement. In: International conference on machine learning. PMLR, pp 5062–5071

59.

Lee GT, Kim CO (2019) Amplifying the imitation effect for reinforcement learning of UCAV’s mission execution. arXiv preprint arXiv:1901.05856

60.

Burda Y, Edwards H, Storkey A, Klimov O (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894

61.

Kang C-Y, Chen M-S (2020) Balancing exploration and exploitation in self-imitation learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 274–285

62.

Hao X, Wang W, Hao J, Yang Y (2019) Independent generative adversarial self-imitation learning in cooperative multiagent systems. arXiv preprint arXiv:1909.11468

63.

Jiang S, Amato C (2021) Multi-agent reinforcement learning with directed exploration and selective memory reuse. In: Proceedings of the 36th annual ACM symposium on applied computing. ACM, pp 777–784

64.

Oliehoek FA, Amato C (2016) A concise introduction to decentralized POMDPs. Springer, BerlinCrossRef

65.

Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27:819–840MathSciNetCrossRef

66.

Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T et al. (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937

67.

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

68.

Yu C, Velu A, Vinitsky E, Wang Y, Bayen A, Wu Y (2021) The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955

69.

Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952

70.

Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv Neural Inf Process Syst (30):1–12

71.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst (30):1–11

72.

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

73.

Samvelyan M, Rashid T, De Witt CS, Farquhar G, Nardelli N, Rudner TG, Hung C-M, Torr PH, Foerster J, Whiteson S (2019) The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043

74.

Hu S, Hu J (2021) Noisy-MAPPO: noisy advantage values for cooperative multi-agent actor-critic methods. arXiv e-prints, arXiv:2106.14334

75.

de Witt CS, Gupta T, Makoviichuk D, Makoviychuk V, Torr PH, Sun M, Whiteson S (2020) Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533

Titel: Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks
verfasst von: Li-yang Zhao
Tian-qing Chang
Lei Zhang
Xin-lu Zhang
Jiang-feng Wang
Publikationsdatum: 30.09.2023
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal of Machine Learning and Cybernetics / Ausgabe 4/2024
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI: https://doi.org/10.1007/s13042-023-01976-6

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Die Gewinner und Laudatoren des Sustainability Award in Automotive 2024/© Uli Regenscheit | ATZlive, Search Icon, Banner Hanser, Additiv gefertigte Teile/© Marina_Skoropadskaya | Getty Images | iStock, Warnschild "Land unter"/© Bluedesign / Fotolia, Gardiner von Trapp/© Alpega Group, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH, adäsion-Webinar-Matinee/© krystiannawrocki_ Getty Images

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

ATZelectronics worldwide

ATZelektronik

Weitere Artikel der Ausgabe 4/2024

Uncertain mean-risk index portfolio selection considering inflation: Chaos adaptive genetic algorithm

Incremental feature selection based on uncertainty measure for dynamic interval-valued data

Efficient low-light image enhancement with model parameters scaled down to 0.02M

Quick reduct with multi-acceleration strategies in incomplete hybrid decision systems

Multi-view reinforcement learning for sequential decision-making with insufficient state information

Multimodal sparse support tensor machine for multiple classification learning

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.