Skip to main content
Top
Published in:

07-12-2023

A Novel Heuristic Exploration Method Based on Action Effectiveness Constraints to Relieve Loop Enhancement Effect in Reinforcement Learning with Sparse Rewards

Authors: Zhenghongyuan Ni, Ye Jin, Peng Liu, Wei Zhao

Published in: Cognitive Computation | Issue 2/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In realistic sparse reward tasks, existing theoretical methods cannot be effectively applied due to the low sampling probability ofrewarded episodes. Profound research on methods based on intrinsic rewards has been conducted to address this issue, but exploration with sparse rewards remains a great challenge. This paper describes the loop enhancement effect in exploration processes with sparse rewards. After each fully trained iteration, the execution probability of ineffective actions is higher than thatof other suboptimal actions, which violates biological habitual behavior principles and is not conducive to effective training. This paper proposes corresponding theorems of relieving the loop enhancement effect in the exploration process with sparse rewards and a heuristic exploration method based on action effectiveness constraints (AEC), which improves policy training efficiency by relieving the loop enhancement effect. Inspired by the fact that animals form habitual behaviors and goal-directed behaviors through the dorsolateral striatum and dorsomedial striatum. The function of the dorsolateral striatum is simulated by an action effectiveness evaluation mechanism (A2EM), which aims to reduce the rate of ineffective samples and improve episode reward expectations. The function of the dorsomedial striatum is simulated by an agent policy network, which aims to achieve task goals. The iterative training of A2EM and the policy forms the AEC model structure. A2EM provides effective samples for the agent policy; the agent policy provides training constraints for A2EM. The experimental results show that A2EM can relieve the loop enhancement effect and has good interpretability and generalizability. AEC enables agents to effectively reduce the loop rate in samples, can collect more effective samples, and improve the efficiency of policy training. The performance of AEC demonstrates the effectiveness of a biological heuristic approach that simulates the function of the dorsal striatum. This approach can be used to improve the robustness of agent exploration with sparse rewards.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Eryilmaz H, Rodriguez-Thompson A, Tanner AS, et al. Neural determinants of human goal-directed vs. habitual action control and their relation to trait motivation. Sci Rep. 2017;7(1):6002.CrossRef Eryilmaz H, Rodriguez-Thompson A, Tanner AS, et al. Neural determinants of human goal-directed vs. habitual action control and their relation to trait motivation. Sci Rep. 2017;7(1):6002.CrossRef
2.
go back to reference Choi K, Piasini E, Díaz-Hernández E, et al. Distributed processing for value-based choice by prelimbic circuits targeting anterior-posterior dorsal striatal subregions in male mice. Nat Commun. 2023;14(1):1920.CrossRef Choi K, Piasini E, Díaz-Hernández E, et al. Distributed processing for value-based choice by prelimbic circuits targeting anterior-posterior dorsal striatal subregions in male mice. Nat Commun. 2023;14(1):1920.CrossRef
3.
go back to reference Villet M, Reynaud-Bouret P, Poitreau J, et al. Coding dynamics of the striatal networks during learning. bioRxiv, 2023: 2023.07. 24.550305. Villet M, Reynaud-Bouret P, Poitreau J, et al. Coding dynamics of the striatal networks during learning. bioRxiv, 2023: 2023.07. 24.550305.
4.
go back to reference Briones BA, Pitcher MN, Fleming WT, et al. Perineuronal nets in the dorsomedial striatum contribute to behavioral dysfunction in mouse models of excessive repetitive behavior. Biol Psychiatry Global Open Sci. 2022;2(4):460–9.CrossRef Briones BA, Pitcher MN, Fleming WT, et al. Perineuronal nets in the dorsomedial striatum contribute to behavioral dysfunction in mouse models of excessive repetitive behavior. Biol Psychiatry Global Open Sci. 2022;2(4):460–9.CrossRef
5.
go back to reference Vandaele Y, Ottenheimer DJ, Janak PH. Dorsomedial striatal activity tracks completion of behavioral sequences in rats. Eneuro. 2021;8(6). Vandaele Y, Ottenheimer DJ, Janak PH. Dorsomedial striatal activity tracks completion of behavioral sequences in rats. Eneuro. 2021;8(6).
6.
go back to reference Heneman RL. Strategic reward management: design, implementations, and evaluation. IAP. 2002. Heneman RL. Strategic reward management: design, implementations, and evaluation. IAP. 2002.
7.
go back to reference Randløv J, Alstrøm P. Learning to drive a bicycle using reinforcement learning and shaping. ICML. 1998;98:463–71. Randløv J, Alstrøm P. Learning to drive a bicycle using reinforcement learning and shaping. ICML. 1998;98:463–71.
8.
go back to reference Xu ZX, Chen XL, Cao L, et al. A study of count-based exploration and bonus for reinforcement learning. In: 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE; 2017. p. 425–9. Xu ZX, Chen XL, Cao L, et al. A study of count-based exploration and bonus for reinforcement learning. In: 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE; 2017. p. 425–9.
9.
go back to reference Baird LC. Reinforcement learning in continuous time: advantage updating. In: International Conference on Neural Networks (ICNN’94), vol. 4. IEEE; 1994. p. 2448–53. Baird LC. Reinforcement learning in continuous time: advantage updating. In: International Conference on Neural Networks (ICNN’94), vol. 4. IEEE; 1994. p. 2448–53.
10.
go back to reference Cho H, Oh P, Park J, et al. Fa3c: Fpga-accelerated deep reinforcement learning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019. p. 499–513. Cho H, Oh P, Park J, et al. Fa3c: Fpga-accelerated deep reinforcement learning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019. p. 499–513.
11.
go back to reference Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: a survey. In: The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021. Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: a survey. In: The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021.
12.
go back to reference Yang Y, Jiang Y, Liu Y, et al. Model-free safe reinforcement learning through neural barrier certificate. IEEE Robot Autom Lett. 2023;8(3):1295–302.CrossRef Yang Y, Jiang Y, Liu Y, et al. Model-free safe reinforcement learning through neural barrier certificate. IEEE Robot Autom Lett. 2023;8(3):1295–302.CrossRef
13.
go back to reference Sutton RS, Barto AG. Reinforcement learning: an introduction. MIT Press; 2018. Sutton RS, Barto AG. Reinforcement learning: an introduction. MIT Press; 2018.
15.
go back to reference Wang X, Wang L, Dong C, et al. An online deep reinforcement learning-based order recommendation framework forrider-centered food delivery system. IEEE Trans Intell Transport Syst. 2023. Wang X, Wang L, Dong C, et al. An online deep reinforcement learning-based order recommendation framework forrider-centered food delivery system. IEEE Trans Intell Transport Syst. 2023.
16.
go back to reference Xin X, Tu Y, Stojanovic V, et al. Online reinforcement learning multiplayer non-zero sum games of continuous-time Markov jump linear systems. Appl Math Comput. 2022;412:126537.MathSciNet Xin X, Tu Y, Stojanovic V, et al. Online reinforcement learning multiplayer non-zero sum games of continuous-time Markov jump linear systems. Appl Math Comput. 2022;412:126537.MathSciNet
17.
go back to reference Dogru O, Wieczorek N, Velswamy K, et al. Online reinforcement learning for a continuous space system with experimental validation. J Process Control. 2021;104:86–100.CrossRef Dogru O, Wieczorek N, Velswamy K, et al. Online reinforcement learning for a continuous space system with experimental validation. J Process Control. 2021;104:86–100.CrossRef
18.
go back to reference Prudencio RF, Maximo MROA, Colombini EL. A survey on offline reinforcement learning: taxonomy, review, and open problems. IEEE Trans Neural Netw Learn Syst. 2023. Prudencio RF, Maximo MROA, Colombini EL. A survey on offline reinforcement learning: taxonomy, review, and open problems. IEEE Trans Neural Netw Learn Syst. 2023.
19.
go back to reference Rome S, Chen T, Kreisel M, et al. Lessons on off-policy methods from a notification component of a chatbot. Mach Learn. 2021;110(9):2577–602.MathSciNetCrossRef Rome S, Chen T, Kreisel M, et al. Lessons on off-policy methods from a notification component of a chatbot. Mach Learn. 2021;110(9):2577–602.MathSciNetCrossRef
20.
go back to reference Cunningham P, Cord M, Delany SJ. Supervised learning. Machine learning techniques for multimedia. Berlin, Heidelberg: Springer; 2008. p. 21–49. Cunningham P, Cord M, Delany SJ. Supervised learning. Machine learning techniques for multimedia. Berlin, Heidelberg: Springer; 2008. p. 21–49.
21.
go back to reference Learned-Miller EG. Introduction to supervised learning. I: Department of Computer Science, University of Massachusetts; 2014.p. 3. Learned-Miller EG. Introduction to supervised learning. I: Department of Computer Science, University of Massachusetts; 2014.p. 3.
23.
go back to reference Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: a survey. In: The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021. Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: a survey. In: The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021.
24.
go back to reference De Asis K, Hernandez-Garcia J, Holland G, et al. Multi-step reinforcement learning: a unifying algorithm. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). 2018. De Asis K, Hernandez-Garcia J, Holland G, et al. Multi-step reinforcement learning: a unifying algorithm. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). 2018.
25.
go back to reference Witty S, Lee JK, Tosch E, et al. Measuring and characterizing generalization in deep reinforcement learning. Appl AI Lett. 2021;2(4):e45.CrossRef Witty S, Lee JK, Tosch E, et al. Measuring and characterizing generalization in deep reinforcement learning. Appl AI Lett. 2021;2(4):e45.CrossRef
26.
go back to reference Zhang J, Kim J, O’Donoghue B, et al. Sample efficient reinforcement learning with REINFORCE. Proc AAAI Conf Artif Intell. 2021;35(12):10887–95. Zhang J, Kim J, O’Donoghue B, et al. Sample efficient reinforcement learning with REINFORCE. Proc AAAI Conf Artif Intell. 2021;35(12):10887–95.
27.
go back to reference Memarian F, Goo W, Lioutikov R, et al. Self-supervised online reward shaping in sparse-reward environments. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2021. p. 2369–75. Memarian F, Goo W, Lioutikov R, et al. Self-supervised online reward shaping in sparse-reward environments. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2021. p. 2369–75.
32.
go back to reference Ng AY, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. ICML. 1999;99:278–87. Ng AY, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. ICML. 1999;99:278–87.
33.
go back to reference Devlin SM, Kudenko D. Dynamic potential-based reward shaping. In: 11th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; 2012. p. 433–40. Devlin SM, Kudenko D. Dynamic potential-based reward shaping. In: 11th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; 2012. p. 433–40.
35.
go back to reference Grzes M. Reward shaping in episodic reinforcement learning. 2017. Grzes M. Reward shaping in episodic reinforcement learning. 2017.
36.
go back to reference Bellemare M, Srinivasan S, Ostrovski G, et al. Unifying count-based exploration and intrinsic motivation. Adv Neural Inf Process Syst. 2016;29. Bellemare M, Srinivasan S, Ostrovski G, et al. Unifying count-based exploration and intrinsic motivation. Adv Neural Inf Process Syst. 2016;29.
38.
go back to reference Jaegle A, Mehrpour V, Rust N. Visual novelty, curiosity, and intrinsic reward in machine learning and the brain. Curr Opin Neurobiol. 2019;58:167–74.CrossRef Jaegle A, Mehrpour V, Rust N. Visual novelty, curiosity, and intrinsic reward in machine learning and the brain. Curr Opin Neurobiol. 2019;58:167–74.CrossRef
39.
go back to reference Strehl AL, Littman ML. An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci. 2008;74(8):1309–31.MathSciNetCrossRef Strehl AL, Littman ML. An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci. 2008;74(8):1309–31.MathSciNetCrossRef
40.
go back to reference Bigazzi R, Landi F, Cascianelli S, et al. Focus on impact: indoor exploration with intrinsic motivation. IEEE Robot Autom Lett. 2022;7(2):2985–92.CrossRef Bigazzi R, Landi F, Cascianelli S, et al. Focus on impact: indoor exploration with intrinsic motivation. IEEE Robot Autom Lett. 2022;7(2):2985–92.CrossRef
41.
go back to reference Honda J, Takemura A. An asymptotically optimal bandit algorithm for bounded support models. COLT. 2010:67–79. Honda J, Takemura A. An asymptotically optimal bandit algorithm for bounded support models. COLT. 2010:67–79.
42.
go back to reference Brafman RI, Tennenholtz M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res. 2002;3(Oct):213–31. Brafman RI, Tennenholtz M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res. 2002;3(Oct):213–31.
43.
go back to reference Yu JY, Mannor S, Shimkin N. Markov decision processes with arbitrary reward processes. Math Oper Res. 2009;34(3):737–57.MathSciNetCrossRef Yu JY, Mannor S, Shimkin N. Markov decision processes with arbitrary reward processes. Math Oper Res. 2009;34(3):737–57.MathSciNetCrossRef
44.
go back to reference Yao Y, Xiao L, An Z, et al. Sample efficient reinforcement learning via model-ensemble exploration and exploitation. In: International Conference on Robotics and Automation (ICRA). IEEE; 2021.p. 4202–8. Yao Y, Xiao L, An Z, et al. Sample efficient reinforcement learning via model-ensemble exploration and exploitation. In: International Conference on Robotics and Automation (ICRA). IEEE; 2021.p. 4202–8.
46.
go back to reference Subramanian K, Isbell CL Jr, Thomaz AL. Exploration from demonstration for interactive reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016. p. 447–56. Subramanian K, Isbell CL Jr, Thomaz AL. Exploration from demonstration for interactive reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016. p. 447–56.
47.
go back to reference Pathak D, Agrawal P, Efros AA, et al. Curiosity-driven exploration by self-supervised prediction. Int Conf Mach Learn. PMLR. 2017:2778–2787. Pathak D, Agrawal P, Efros AA, et al. Curiosity-driven exploration by self-supervised prediction. Int Conf Mach Learn. PMLR. 2017:2778–2787.
48.
go back to reference Ryan RM, Deci EL. Intrinsic and extrinsic motivations: classic definitions and new directions. Contemp Educ Psychol. 2000;25(1):54–67.CrossRef Ryan RM, Deci EL. Intrinsic and extrinsic motivations: classic definitions and new directions. Contemp Educ Psychol. 2000;25(1):54–67.CrossRef
49.
go back to reference Manoury A, Buche C. Chime: an adaptive hierarchical representation for continuous intrinsically motivated exploration. In: 2019 Third IEEE International Conference on Robotic Computing (IRC). IEEE; 2019. p. 167–70. Manoury A, Buche C. Chime: an adaptive hierarchical representation for continuous intrinsically motivated exploration. In: 2019 Third IEEE International Conference on Robotic Computing (IRC). IEEE; 2019. p. 167–70.
50.
go back to reference Gordon G. Infant-inspired intrinsically motivated curious robots. Curr Opin Behav Sci. 2020;35:28–34.CrossRef Gordon G. Infant-inspired intrinsically motivated curious robots. Curr Opin Behav Sci. 2020;35:28–34.CrossRef
51.
go back to reference Hellman RB, Tekin C, van der Schaar M, et al. Functional contour-following via haptic perception and reinforcement learning. IEEE Trans Haptics. 2017;11(1):61–72.CrossRef Hellman RB, Tekin C, van der Schaar M, et al. Functional contour-following via haptic perception and reinforcement learning. IEEE Trans Haptics. 2017;11(1):61–72.CrossRef
52.
go back to reference D’Eramo C, Cini A, Restelli M. Exploiting action-value uncertainty to drive exploration in reinforcement learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE; 2019.p. 1–8. D’Eramo C, Cini A, Restelli M. Exploiting action-value uncertainty to drive exploration in reinforcement learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE; 2019.p. 1–8.
53.
go back to reference Osband I, Van Roy B, Russo DJ, et al. Deep exploration via randomized value functions. J Mach Learn Res. 2019;20(124):1–62.MathSciNet Osband I, Van Roy B, Russo DJ, et al. Deep exploration via randomized value functions. J Mach Learn Res. 2019;20(124):1–62.MathSciNet
54.
go back to reference Klyubin AS, Polani D, Nehaniv CL. All else being equal be empowered. In: European Conference on Artificial Life. Berlin, Heidelberg: Springer; 2005. p. 744–53. Klyubin AS, Polani D, Nehaniv CL. All else being equal be empowered. In: European Conference on Artificial Life. Berlin, Heidelberg: Springer; 2005. p. 744–53.
55.
go back to reference Rezende D, Mohamed S. Variational inference with normalizing flows. Int Conf Mach Learn. PMLR. 2015:1530–8. Rezende D, Mohamed S. Variational inference with normalizing flows. Int Conf Mach Learn. PMLR. 2015:1530–8.
56.
go back to reference Schmidhuber J. A possibility for implementing curiosity and boredom in model-building neural controllers. Int Conf Simul Adapt Behav. 1991:222–7. Schmidhuber J. A possibility for implementing curiosity and boredom in model-building neural controllers. Int Conf Simul Adapt Behav. 1991:222–7.
57.
go back to reference Gottlieb J, Oudeyer PY, Lopes M, et al. Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn Sci. 2013;17(11):585–93.CrossRef Gottlieb J, Oudeyer PY, Lopes M, et al. Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn Sci. 2013;17(11):585–93.CrossRef
58.
go back to reference Schmidhuber J. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connect Sci. 2006;18(2):173–87.CrossRef Schmidhuber J. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connect Sci. 2006;18(2):173–87.CrossRef
60.
go back to reference Parisi S, Dean V, Pathak D, et al. Interesting object, curious agent: learning task-agnostic exploration. Adv Neural Inf Process Syst. 2021;34:20516–30. Parisi S, Dean V, Pathak D, et al. Interesting object, curious agent: learning task-agnostic exploration. Adv Neural Inf Process Syst. 2021;34:20516–30.
62.
go back to reference Turner KM, Svegborn A, Langguth M, et al. Opposing roles of the dorsolateral and dorsomedial striatum in the acquisition of skilled action sequencing in rats. J Neurosci. 2022;42(10):2039–51.CrossRef Turner KM, Svegborn A, Langguth M, et al. Opposing roles of the dorsolateral and dorsomedial striatum in the acquisition of skilled action sequencing in rats. J Neurosci. 2022;42(10):2039–51.CrossRef
63.
go back to reference Kang S, Hong SI, Lee J, et al. Activation of astrocytes in the dorsomedial striatum facilitates transition from habitual to goal-directed reward-seeking behavior. Biol Psychiat. 2020;88(10):797–808.CrossRef Kang S, Hong SI, Lee J, et al. Activation of astrocytes in the dorsomedial striatum facilitates transition from habitual to goal-directed reward-seeking behavior. Biol Psychiat. 2020;88(10):797–808.CrossRef
64.
go back to reference Gremel CM, Costa RM. Orbitofrontal and striatal circuits dynamically encode the shift between goal-directed and habitual actions. Nat Commun. 2013;4(1):2264. Gremel CM, Costa RM. Orbitofrontal and striatal circuits dynamically encode the shift between goal-directed and habitual actions. Nat Commun. 2013;4(1):2264.
Metadata
Title
A Novel Heuristic Exploration Method Based on Action Effectiveness Constraints to Relieve Loop Enhancement Effect in Reinforcement Learning with Sparse Rewards
Authors
Zhenghongyuan Ni
Ye Jin
Peng Liu
Wei Zhao
Publication date
07-12-2023
Publisher
Springer US
Published in
Cognitive Computation / Issue 2/2024
Print ISSN: 1866-9956
Electronic ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-023-10226-4

Other articles of this Issue 2/2024

Cognitive Computation 2/2024 Go to the issue

Premium Partner