Skip to main content

2020 | OriginalPaper | Buchkapitel

Faithful and Effective Reward Schemes for Model-Free Reinforcement Learning of Omega-Regular Objectives

verfasst von : Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak

Erschienen in: Automated Technology for Verification and Analysis

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Omega-regular properties—specified using linear time temporal logic or various forms of omega-automata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for model-free RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-59152-6_6/495553_1_En_6_IEq1_HTML.gif -regular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to off-the-shelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with off-the-shelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for model-free RL with omega-regular objectives.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Andersson, D., Miltersen, P.B.: The complexity of solving stochastic games on graphs. In: Algorithms and Computation, pp. 112–121 (2009) Andersson, D., Miltersen, P.B.: The complexity of solving stochastic games on graphs. In: Algorithms and Computation, pp. 112–121 (2009)
3.
Zurück zum Zitat Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. CoRR, abs/1909.07299 (2019) Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. CoRR, abs/1909.07299 (2019)
4.
Zurück zum Zitat Brockman, G., et al.: OpenAI Gym. CoRR, abs/1606.01540 (2016) Brockman, G., et al.: OpenAI Gym. CoRR, abs/1606.01540 (2016)
5.
Zurück zum Zitat Camacho, A., Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: LTL and beyond: formal languages for reward function specification in reinforcement learning. In: Joint Conference on Artificial Intelligence, pp. 6065–6073 (2019) Camacho, A., Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: LTL and beyond: formal languages for reward function specification in reinforcement learning. In: Joint Conference on Artificial Intelligence, pp. 6065–6073 (2019)
6.
Zurück zum Zitat Courcoubetis, C., Yannakakis, M.: The complexity of probabilistic verification. J. ACM 42(4), 857–907 (1995)MathSciNetCrossRef Courcoubetis, C., Yannakakis, M.: The complexity of probabilistic verification. J. ACM 42(4), 857–907 (1995)MathSciNetCrossRef
8.
Zurück zum Zitat De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic on finite traces. In: IJCAI, pp. 854–860 (2013) De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic on finite traces. In: IJCAI, pp. 854–860 (2013)
9.
Zurück zum Zitat Etessami, K., Wilke, T., Schuller, R.A.: Fair simulation relations, parity games, and state space reduction for Büchi automata. SIAM J. Comput. 34(5), 1159–1175 (2005)MathSciNetCrossRef Etessami, K., Wilke, T., Schuller, R.A.: Fair simulation relations, parity games, and state space reduction for Büchi automata. SIAM J. Comput. 34(5), 1159–1175 (2005)MathSciNetCrossRef
10.
Zurück zum Zitat Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics Science and Systems (2014) Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics Science and Systems (2014)
11.
Zurück zum Zitat Hahn, E.M., Li, G., Schewe, S., Turrini, A., Zhang, L.: Lazy probabilistic model checking without determinisation. In: Concurrency Theory, pp. 354–367 (2015) Hahn, E.M., Li, G., Schewe, S., Turrini, A., Zhang, L.: Lazy probabilistic model checking without determinisation. In: Concurrency Theory, pp. 354–367 (2015)
13.
Zurück zum Zitat Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Good-for-MDPs automata for probabilistic analysis and reinforcement learning. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 306–323 (2020) Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Good-for-MDPs automata for probabilistic analysis and reinforcement learning. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 306–323 (2020)
14.
Zurück zum Zitat Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Conference on Decision and Control, December 2019 Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Conference on Decision and Control, December 2019
18.
Zurück zum Zitat Icarte, T.R., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: Conference on Machine Learning, pp. 2112–2121, July 2018 Icarte, T.R., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: Conference on Machine Learning, pp. 2112–2121, July 2018
20.
Zurück zum Zitat Křetínský, J., Pérez, G.A., Raskin, J.-F.: Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints. In: CONCUR, vol. 118, LIPIcs, pp. 8:1–8:18 (2018) Křetínský, J., Pérez, G.A., Raskin, J.-F.: Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints. In: CONCUR, vol. 118, LIPIcs, pp. 8:1–8:18 (2018)
22.
Zurück zum Zitat Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for unknown continuous-space MDPs via model-free reinforcement learning. In: International Conference on Cyber-Physical Systems, April 2020 Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for unknown continuous-space MDPs via model-free reinforcement learning. In: International Conference on Cyber-Physical Systems, April 2020
24.
Zurück zum Zitat Liggett, T.M., Lippman, S.A.: Short notes: stochastic games with perfect information and time average payoff. SIAM Rev. 11(4), 604–607 (1969)MathSciNetCrossRef Liggett, T.M., Lippman, S.A.: Short notes: stochastic games with perfect information and time average payoff. SIAM Rev. 11(4), 604–607 (1969)MathSciNetCrossRef
25.
Zurück zum Zitat Manna, Z., Pnueli, A.: The Temporal Logic of Reactive and Concurrent Systems *Specification*. Springer (1991) Manna, Z., Pnueli, A.: The Temporal Logic of Reactive and Concurrent Systems *Specification*. Springer (1991)
26.
Zurück zum Zitat Milnerm, R.: An algebraic definition of simulation between programs. Int. Joint Conf. Artif. Intell. 23, 481–489 (1971) Milnerm, R.: An algebraic definition of simulation between programs. Int. Joint Conf. Artif. Intell. 23, 481–489 (1971)
27.
Zurück zum Zitat Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: theory and application to reward shaping. In: International Conference on Machine Learning, pp. 278–287 (1999) Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: theory and application to reward shaping. In: International Conference on Machine Learning, pp. 278–287 (1999)
28.
Zurück zum Zitat Perrin, D., Pin, J.-É.: Infinite Words: Automata, Semigroups. Elsevier, Logic and Games (2004) Perrin, D., Pin, J.-É.: Infinite Words: Automata, Semigroups. Elsevier, Logic and Games (2004)
29.
Zurück zum Zitat Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRef Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRef
30.
Zurück zum Zitat Sadigh, D., Kim, E., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091–1096, December 2014 Sadigh, D., Kim, E., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091–1096, December 2014
33.
Zurück zum Zitat Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: International Conference on Machine Learning, ICML, pp. 881–888 (2006) Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: International Conference on Machine Learning, ICML, pp. 881–888 (2006)
35.
Zurück zum Zitat Sutton, R.S., Barto, A.G.: Reinforcement Learnging: An Introduction. MIT Press, 2nd edn (2018) Sutton, R.S., Barto, A.G.: Reinforcement Learnging: An Introduction. MIT Press, 2nd edn (2018)
36.
Zurück zum Zitat van Hasselt, H.: Double \(Q\)-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010 van Hasselt, H.: Double \(Q\)-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010
37.
Zurück zum Zitat Vardi, M.Y.: Automatic verification of probabilistic concurrent finite state programs. In: Foundations of Computer Science, pp. 327–338 (1985) Vardi, M.Y.: Automatic verification of probabilistic concurrent finite state programs. In: Foundations of Computer Science, pp. 327–338 (1985)
38.
Zurück zum Zitat Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)MATH Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)MATH
Metadaten
Titel
Faithful and Effective Reward Schemes for Model-Free Reinforcement Learning of Omega-Regular Objectives
verfasst von
Ernst Moritz Hahn
Mateo Perez
Sven Schewe
Fabio Somenzi
Ashutosh Trivedi
Dominik Wojtczak
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-59152-6_6

Premium Partner