Skip to main content
Erschienen in: Neural Computing and Applications 3/2022

25.05.2021 | S.I. : Adaptive and Learning Agents 2020

Lucid dreaming for experience replay: refreshing past states with the current policy

verfasst von: Yunshu Du, Garrett Warnell, Assefaw Gebremedhin, Peter Stone, Matthew E. Taylor

Erschienen in: Neural Computing and Applications | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Experience replay (ER) improves the data efficiency of off-policy reinforcement learning (RL) algorithms by allowing an agent to store and reuse its past experiences in a replay buffer. While many techniques have been proposed to enhance ER by biasing how experiences are sampled from the buffer, thus far they have not considered strategies for refreshing experiences inside the buffer. In this work, we introduce L uc i d D reaming for E xperience R eplay (LiDER), a conceptually new framework that allows replay experiences to be refreshed by leveraging the agent’s current policy. LiDER consists of three steps: First, LiDER moves an agent back to a past state. Second, from that state, LiDER then lets the agent execute a sequence of actions by following its current policy—as if the agent were “dreaming” about the past and can try out different behaviors to encounter new experiences in the dream. Third, LiDER stores and reuses the new experience if it turned out better than what the agent previously experienced, i.e., to refresh its memories. LiDER is designed to be easily incorporated into off-policy, multi-worker RL algorithms that use ER; we present in this work a case study of applying LiDER to an actor–critic-based algorithm. Results show LiDER consistently improves performance over the baseline in six Atari 2600 games. Our open-source implementation of LiDER and the data used to generate all plots in this work are available at https://​github.​com/​duyunshu/​lucid-dreaming-for-exp-replay.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
A one-step reward r is usually stored instead of the cumulative return (e.g., Mnih et al. [27]). In this work, we follow Oh et al. [32] and store the Monte-Carlo return \({\mathrm{G}}\); we fully describe the buffer structure in Sect. 3.
 
2
The implementation of A3CTBSIL is open-sourced at https://​github.​com/​gabrieledcjr/​DeepRL. In de la Cruz Jr et al. [6], we also considered using demonstrations to improve A3CTBSIL, which is not the baseline used in this work.
 
3
Note that while the A3C algorithm is on-policy, integrating A3C with SIL makes it an off-policy algorithm (as in Oh et al. [32]).
 
4
Note the performance in Montezuma’s Revenge differs between A3CTBSIL [6] and the original SIL algorithm [32]—see the discussion in “Appendix 4.”
 
5
Note that the baseline A3CTBSIL represents the scenario of SampleD, i.e., always sample from buffer \({{\,\mathrm{\mathcal {D}}\,}}\).
 
7
The policy-based Go-Explore algorithm is an extension of the Go-Explore without a policy framework, which was presented in an earlier pre-print [9]. Go-Explore without a policy framework also leverages the simulator reset feature.
 
8
Ecoffet et al. [10] made a detailed comparison between the policy-based Go-Explore and DTSIL. We refer the interested readers to Ecoffet et al. [10] for further reading.
 
Literatur
3.
Zurück zum Zitat Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279CrossRef Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279CrossRef
4.
Zurück zum Zitat Chan H, Wu Y, Kiros J, Fidler S, Ba J (2019) ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv:190204546 Chan H, Wu Y, Kiros J, Fidler S, Ba J (2019) ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv:​190204546
6.
Zurück zum Zitat de la Cruz Jr GV, Du Y, Taylor ME (2019) Jointly pre-training with supervised, autoencoder, and value losses for deep reinforcement learning. In: Adaptive and learning agents workshop, AAMAS de la Cruz Jr GV, Du Y, Taylor ME (2019) Jointly pre-training with supervised, autoencoder, and value losses for deep reinforcement learning. In: Adaptive and learning agents workshop, AAMAS
8.
Zurück zum Zitat De Bruin T, Kober J, Tuyls K, Babuška R (2015) The importance of experience replay database composition in deep reinforcement learning. In: Deep reinforcement learning workshop, NIPS De Bruin T, Kober J, Tuyls K, Babuška R (2015) The importance of experience replay database composition in deep reinforcement learning. In: Deep reinforcement learning workshop, NIPS
9.
Zurück zum Zitat Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:190110995 Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:​190110995
10.
11.
Zurück zum Zitat Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, Firoiu V, Harley T, Dunning I, Legg S, Kavukcuoglu K (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of Machine learning research 80:1407–1416. http://proceedings.mlr.press/v80/espeholt18a.html Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, Firoiu V, Harley T, Dunning I, Legg S, Kavukcuoglu K (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of Machine learning research 80:1407–1416. http://​proceedings.​mlr.​press/​v80/​espeholt18a.​html
18.
Zurück zum Zitat Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G, Agapiou J, Leibo JZ, Gruslys A (2018) Deep Q-learning from demonstrations. In: Annual meeting of the association for the advancement of artificial intelligence (AAAI), New Orleans (USA) Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G, Agapiou J, Leibo JZ, Gruslys A (2018) Deep Q-learning from demonstrations. In: Annual meeting of the association for the advancement of artificial intelligence (AAAI), New Orleans (USA)
20.
Zurück zum Zitat Hosu IA, Rebedea T (2016) Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:160705077 Hosu IA, Rebedea T (2016) Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:​160705077
24.
Zurück zum Zitat Lin LJ (1992) Self-improving reactive agents based on reinforcement learning. Planning and teaching. Mach Learn 8(3–4):293–321 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning. Planning and teaching. Mach Learn 8(3–4):293–321
25.
Zurück zum Zitat Liu R, Zou J (2018) The effects of memory replay in reinforcement learning. In: The 56th annual allerton conference on communication, control, and computing, pp 478–485 Liu R, Zou J (2018) The effects of memory replay in reinforcement learning. In: The 56th annual allerton conference on communication, control, and computing, pp 478–485
27.
Zurück zum Zitat Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529CrossRef Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529CrossRef
28.
Zurück zum Zitat Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of machine learning research, PMLR, New York, New York, USA, vol 48, pp 1928–1937. http://proceedings.mlr.press/v48/mniha16.html Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of machine learning research, PMLR, New York, New York, USA, vol 48, pp 1928–1937. http://​proceedings.​mlr.​press/​v48/​mniha16.​html
33.
Zurück zum Zitat Pohlen T, Piot B, Hester T, Azar MG, Horgan D, Budden D, Barth-Maron G, van Hasselt H, Quan J, Večerík M, et al. (2018) Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:180511593 Pohlen T, Piot B, Hester T, Azar MG, Horgan D, Budden D, Barth-Maron G, van Hasselt H, Quan J, Večerík M, et al. (2018) Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:​180511593
34.
Zurück zum Zitat Resnick C, Raileanu R, Kapoor S, Peysakhovich A, Cho K, Bruna J (2018) Backplay:” Man muss immer umkehren”. In: Workshop on reinforcement learning in games, AAAI Resnick C, Raileanu R, Kapoor S, Peysakhovich A, Cho K, Bruna J (2018) Backplay:” Man muss immer umkehren”. In: Workshop on reinforcement learning in games, AAAI
35.
36.
37.
Zurück zum Zitat Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: International conference on learning representations. arXiv:1511.05952 Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: International conference on learning representations. arXiv:​1511.​05952
38.
Zurück zum Zitat Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering Atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:191108265 Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering Atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:​191108265
39.
Zurück zum Zitat Sinha S, Song J, Garg A, Ermon S (2020) Experience replay with likelihood-free importance weights. arXiv preprint arXiv:200613169 Sinha S, Song J, Garg A, Ermon S (2020) Experience replay with likelihood-free importance weights. arXiv preprint arXiv:​200613169
41.
Zurück zum Zitat Stumbrys T, Erlacher D, Schredl M (2016) Effectiveness of motor practice in lucid dreams: a comparison with physical and mental practice. J Sports Sci 34:27–34CrossRef Stumbrys T, Erlacher D, Schredl M (2016) Effectiveness of motor practice in lucid dreams: a comparison with physical and mental practice. J Sports Sci 34:27–34CrossRef
42.
Zurück zum Zitat Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, CambridgeMATH Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, CambridgeMATH
44.
46.
Zurück zum Zitat Wawrzyński P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Netw 22(10):1484–1497CrossRef Wawrzyński P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Netw 22(10):1484–1497CrossRef
49.
Metadaten
Titel
Lucid dreaming for experience replay: refreshing past states with the current policy
verfasst von
Yunshu Du
Garrett Warnell
Assefaw Gebremedhin
Peter Stone
Matthew E. Taylor
Publikationsdatum
25.05.2021
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 3/2022
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-021-06104-5

Weitere Artikel der Ausgabe 3/2022

Neural Computing and Applications 3/2022 Zur Ausgabe

Premium Partner