nach oben

Neural Computing and Applications

Erschienen in:

25.05.2021 | S.I. : Adaptive and Learning Agents 2020

Lucid dreaming for experience replay: refreshing past states with the current policy

verfasst von: Yunshu Du, Garrett Warnell, Assefaw Gebremedhin, Peter Stone, Matthew E. Taylor

Erschienen in: Neural Computing and Applications | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Experience replay (ER) improves the data efficiency of off-policy reinforcement learning (RL) algorithms by allowing an agent to store and reuse its past experiences in a replay buffer. While many techniques have been proposed to enhance ER by biasing how experiences are sampled from the buffer, thus far they have not considered strategies for refreshing experiences inside the buffer. In this work, we introduce L uc i d D reaming for E xperience R eplay (LiDER), a conceptually new framework that allows replay experiences to be refreshed by leveraging the agent’s current policy. LiDER consists of three steps: First, LiDER moves an agent back to a past state. Second, from that state, LiDER then lets the agent execute a sequence of actions by following its current policy—as if the agent were “dreaming” about the past and can try out different behaviors to encounter new experiences in the dream. Third, LiDER stores and reuses the new experience if it turned out better than what the agent previously experienced, i.e., to refresh its memories. LiDER is designed to be easily incorporated into off-policy, multi-worker RL algorithms that use ER; we present in this work a case study of applying LiDER to an actor–critic-based algorithm. Results show LiDER consistently improves performance over the baseline in six Atari 2600 games. Our open-source implementation of LiDER and the data used to generate all plots in this work are available at https://github.com/duyunshu/lucid-dreaming-for-exp-replay.

Vorheriger Artikel Policy invariant explicit shaping: an efficient alternative to reward shaping

Nächster Artikel Discrete-to-deep reinforcement learning methods

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

A one-step reward r is usually stored instead of the cumulative return (e.g., Mnih et al. [27]). In this work, we follow Oh et al. [32] and store the Monte-Carlo return \({\mathrm{G}}\); we fully describe the buffer structure in Sect. 3.

The implementation of A3CTBSIL is open-sourced at https://github.com/gabrieledcjr/DeepRL. In de la Cruz Jr et al. [6], we also considered using demonstrations to improve A3CTBSIL, which is not the baseline used in this work.

Note that while the A3C algorithm is on-policy, integrating A3C with SIL makes it an off-policy algorithm (as in Oh et al. [32]).

Note the performance in Montezuma’s Revenge differs between A3CTBSIL [6] and the original SIL algorithm [32]—see the discussion in “Appendix 4.”

Note that the baseline A3CTBSIL represents the scenario of SampleD, i.e., always sample from buffer \({{\,\mathrm{\mathcal {D}}\,}}\).

The data is publicly available: https://github.com/gabrieledcjr/atari_human_demo

The policy-based Go-Explore algorithm is an extension of the Go-Explore without a policy framework, which was presented in an earlier pre-print [9]. Go-Explore without a policy framework also leverages the simulator reset feature.

Ecoffet et al. [10] made a detailed comparison between the policy-based Go-Explore and DTSIL. We refer the interested readers to Ecoffet et al. [10] for further reading.

Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W (2017) Hindsight experience replay. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., 30:5048–5058. https://proceedings.neurips.cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf

Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., vol 29, pp 1471–1479. https://proceedings.neurips.cc/paper/2016/file/afda332245e2af431fb7b672a68b659d-Paper.pdf

Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279CrossRef

Chan H, Wu Y, Kiros J, Fidler S, Ba J (2019) ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv:190204546

de la Cruz GV, Du Y, Taylor ME (2019) Pre-training with non-expert human demonstration for deep reinforcement learning. Knowl Eng Rev 34:e10. https://doi.org/10.1017/S0269888919000055CrossRef

de la Cruz Jr GV, Du Y, Taylor ME (2019) Jointly pre-training with supervised, autoencoder, and value losses for deep reinforcement learning. In: Adaptive and learning agents workshop, AAMAS

Dao G, Lee M (2019) Relevant experiences in replay buffer. In: 2019 IEEE symposium series on computational intelligence (SSCI), pp 94–101. https://doi.org/10.1109/SSCI44817.2019.9002745

De Bruin T, Kober J, Tuyls K, Babuška R (2015) The importance of experience replay database composition in deep reinforcement learning. In: Deep reinforcement learning workshop, NIPS

Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:190110995

10.

Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2020) First return then explore. arXiv preprint arXiv:200412919

11.

Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, Firoiu V, Harley T, Dunning I, Legg S, Kavukcuoglu K (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of Machine learning research 80:1407–1416. http://proceedings.mlr.press/v80/espeholt18a.html

12.

Fedus W, Ramachandran P, Agarwal R, Bengio Y, Larochelle H, Rowland M, Dabney W (2020) Revisiting fundamentals of experience replay. In: Proceedings of the 37th international conference on machine learning, PMLR. https://proceedings.icml.cc/paper/2020/hash/5460b9ea1986ec386cb64df22dff37be-Abstract.html

13.

Florensa C, Held D, Wulfmeier M, Zhang M, Abbeel P (2017) Reverse curriculum generation for reinforcement learning. In: Levine S, Vanhoucke V, Goldberg K (eds) Proceedings of machine learning research, PMLR, 78:482–495. http://proceedings.mlr.press/v78/florensa17a.html

14.

Gangwani T, Liu Q, Peng J (2019) Learning self-imitating diverse policies. In: International conference on learning representations. https://openreview.net/forum?id=HyxzRsR9Y7

15.

Gruslys A, Dabney W, Azar MG, Piot B, Bellemare M, Munos R (2018) The reactor: a fast and sample-efficient actor-critic agent for reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=rkHVZWZAZ

16.

Guo Y, Choi J, Moczulski M, Feng S, Bengio S, Norouzi M, Lee H (2020) Memory based trajectory-conditioned policies for learning from sparse rewards. In: Advances in neural information processing systems. https://papers.nips.cc/paper/2020/hash/2df45244f09369e16ea3f9117ca45157-Abstract.html

17.

He FS, Liu Y, Schwing AG, Peng J (2017) Learning to play in a day: faster deep reinforcement learning by optimality tightening. In: International conference on learning representations. https://openreview.net/forum?id=rJ8Je4clg

18.

Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G, Agapiou J, Leibo JZ, Gruslys A (2018) Deep Q-learning from demonstrations. In: Annual meeting of the association for the advancement of artificial intelligence (AAAI), New Orleans (USA)

19.

Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Hasselt H, Silver D (2018) Distributed prioritized experience replay. In: International conference on learning representations. https://openreview.net/forum?id=H1Dy---0Z

20.

Hosu IA, Rebedea T (2016) Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:160705077

21.

Kapturowski S, Ostrovski G, Dabney W, Quan J, Munos R (2019) Recurrent experience replay in distributed reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=r1lyTjAqYX

22.

Le L, Patterson A, White M (2018) Supervised autoencoders: improving generalization performance with unsupervised regularizers. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., 31:107–117. https://proceedings.neurips.cc/paper/2018/file/2a38a4a9316c49e5a833517c45d31070-Paper.pdf

23.

Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=tX_O8O-8Zl

24.

Lin LJ (1992) Self-improving reactive agents based on reinforcement learning. Planning and teaching. Mach Learn 8(3–4):293–321

25.

Liu R, Zou J (2018) The effects of memory replay in reinforcement learning. In: The 56th annual allerton conference on communication, control, and computing, pp 478–485

26.

Mihalkova L, Mooney R (2006) Using active relocation to aid reinforcement learning. In: Proceedings of the 19th international FLAIRS conference (FLAIRS-2006), Melbourne Beach, FL, pp 580–585. http://www.cs.utexas.edu/users/ai-lab?mihalkova:flairs06

27.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529CrossRef

28.

Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of machine learning research, PMLR, New York, New York, USA, vol 48, pp 1928–1937. http://proceedings.mlr.press/v48/mniha16.html

29.

Munos R, Stepleton T, Harutyunyan A, Bellemare M (2016) Safe and efficient off-policy reinforcement learning. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., vol 29, pp 1054–1062. https://proceedings.neurips.cc/paper/2016/file/c3992e9a68c5ae12bd18488bc579b30d-Paper.pdf

30.

Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation (ICRA), pp 6292–6299. https://doi.org/10.1109/ICRA.2018.8463162

31.

Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of machine learning research, PMLR, Long Beach, California, USA, vol 97, pp 4851–4860. http://proceedings.mlr.press/v97/novati19a.html

32.

Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: Dy J, Krause A (eds) Proceedings of machine learning research, PMLR, Stockholmsmässan, Stockholm Sweden, vol 80, pp 3878–3887. http://proceedings.mlr.press/v80/oh18b.html

33.

Pohlen T, Piot B, Hester T, Azar MG, Horgan D, Budden D, Barth-Maron G, van Hasselt H, Quan J, Večerík M, et al. (2018) Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:180511593

34.

Resnick C, Raileanu R, Kapoor S, Peysakhovich A, Cho K, Bruna J (2018) Backplay:” Man muss immer umkehren”. In: Workshop on reinforcement learning in games, AAAI

35.

Ross S, Bagnell D (2010) Efficient reductions for imitation learning. In: Teh YW, Titterington M (eds) Proceedings of machine learning research, JMLR workshop and conference proceedings, Chia Laguna Resort, Sardinia, Italy, 9:661–668. http://proceedings.mlr.press/v9/ross10a.html

36.

Salimans T, Chen R (2018) Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:181203381

37.

Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: International conference on learning representations. arXiv:1511.05952

38.

Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering Atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:191108265

39.

Sinha S, Song J, Garg A, Ermon S (2020) Experience replay with likelihood-free importance weights. arXiv preprint arXiv:200613169

40.

Sovrano F (2019) Combining experience replay with exploration by random network distillation. In: 2019 IEEE conference on games (CoG), pp 1–8. https://doi.org/10.1109/CIG.2019.8848046

41.

Stumbrys T, Erlacher D, Schredl M (2016) Effectiveness of motor practice in lucid dreams: a comparison with physical and mental practice. J Sports Sci 34:27–34CrossRef

42.

Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, CambridgeMATH

43.

Tang Y (2020) Self-imitation learning via generalized lower bound Q-learning. In: Advances in neural information processing systems, vol 33. https://papers.nips.cc/paper/2020/file/a0443c8c8c3372d662e9173c18faaa2c-Paper.pdf

44.

Tavakoli A, Levdik V, Islam R, Smith CM, Kormushev P (2018) Exploring restart distributions. arXiv:181111298

45.

Wang Z, Bapst V, Heess NMO, Mnih V, Munos R, Kavukcuoglu K, de Freitas N (2017) Sample efficient actor-critic with experience replay. In: International conference on learning representations. https://openreview.net/pdf?id=HyM25Mqel

46.

Wawrzyński P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Netw 22(10):1484–1497CrossRef

47.

Zha D, Lai KH, Zhou K, Hu X (2019) Experience replay optimization. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, international joint conferences on artificial intelligence organization, pp 4243–4249. https://doi.org/10.24963/ijcai.2019/589, https://doi.org/10.24963/ijcai.2019/589

48.

Zhang S, Sutton RS (2017) A deeper look at experience replay. arXiv preprint arXiv:171201275

49.

Zhang X, Bharti SK, Ma Y, Singla A, Zhu X (2020) The teaching Dimension of Q-learning. arXiv preprint arXiv:200609324

Titel: Lucid dreaming for experience replay: refreshing past states with the current policy
verfasst von: Yunshu Du
Garrett Warnell
Assefaw Gebremedhin
Peter Stone
Matthew E. Taylor
Publikationsdatum: 25.05.2021
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 3/2022
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-021-06104-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2022

Emotionally charged text classification with deep learning and sentiment semantic

GSTA: gated spatial–temporal attention approach for travel time prediction

Multivariate regression and genetic programming for prediction of backbreak in open-pit blasting

Opponent learning awareness and modelling in multi-objective normal form games

Correction to: Noise-estimation-based anisotropic diffusion approach for retinal blood vessel segmentation

Domain and writer adaptation of offline Arabic handwriting recognition using deep neural networks

Premium Partner