Skip to main content
Top

2019 | OriginalPaper | Chapter

Safe Policy Learning with Constrained Return Variance

Author : Arushi Jain

Published in: Advances in Artificial Intelligence

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

It is desirable for a safety-critical application that the agent performs in a reliable and repeatable manner which conventional setting in reinforcement learning (RL) often fails to provide. In this work, we derive a novel algorithm to learn a safe hierarchical policy by constraining the direct estimate of the variance in the return in the Option-Critic framework [1]. We first present the novel theorem of safe control in the policy gradient methods and then extend the derivation to the Option-Critic framework.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bacon, P.L., Harb, J., Precup, D.: The option-critic architecture. In: AAAI, pp. 1726–1734 (2017) Bacon, P.L., Harb, J., Precup, D.: The option-critic architecture. In: AAAI, pp. 1726–1734 (2017)
2.
go back to reference Jain, A., Khetarpal, K., Precup, D.: Safe option-critic: Learning safety in the option-critic architecture. arXiv preprint arXiv:1807.08060 (2018) Jain, A., Khetarpal, K., Precup, D.: Safe option-critic: Learning safety in the option-critic architecture. arXiv preprint arXiv:​1807.​08060 (2018)
3.
go back to reference Prashanth, L., Ghavamzadeh, M.: Actor-critic algorithms for risk-sensitive MDPs. In: Advances in Neural Information Processing Systems, pp. 252–260 (2013) Prashanth, L., Ghavamzadeh, M.: Actor-critic algorithms for risk-sensitive MDPs. In: Advances in Neural Information Processing Systems, pp. 252–260 (2013)
4.
go back to reference Sato, M., Kimura, H., Kobayashi, S.: TD algorithm for the variance of return and mean-variance reinforcement learning. Trans. Jpn. Soc. Artif. Intell. 16(3), 353–362 (2001)CrossRef Sato, M., Kimura, H., Kobayashi, S.: TD algorithm for the variance of return and mean-variance reinforcement learning. Trans. Jpn. Soc. Artif. Intell. 16(3), 353–362 (2001)CrossRef
5.
go back to reference Sherstan, C., et al.: Directly estimating the variance of the \(\lambda \)-return using temporal-difference methods. arXiv preprint arXiv:1801.08287 (2018) Sherstan, C., et al.: Directly estimating the variance of the \(\lambda \)-return using temporal-difference methods. arXiv preprint arXiv:​1801.​08287 (2018)
6.
go back to reference Tamar, A., Di Castro, D., Mannor, S.: Policy gradients with variance related risk criteria. In: Proceedings of the Twenty-ninth International Conference on Machine Learning, pp. 387–396 (2012) Tamar, A., Di Castro, D., Mannor, S.: Policy gradients with variance related risk criteria. In: Proceedings of the Twenty-ninth International Conference on Machine Learning, pp. 387–396 (2012)
7.
go back to reference Tamar, A., Di Castro, D., Mannor, S.: Learning the variance of the reward-to-go. J. Mach. Learn. Res. 17(13), 1–36 (2016)MathSciNetMATH Tamar, A., Di Castro, D., Mannor, S.: Learning the variance of the reward-to-go. J. Mach. Learn. Res. 17(13), 1–36 (2016)MathSciNetMATH
Metadata
Title
Safe Policy Learning with Constrained Return Variance
Author
Arushi Jain
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-18305-9_68

Premium Partner