Skip to main content

2020 | OriginalPaper | Buchkapitel

PBCS: Efficient Exploration and Exploitation Using a Synergy Between Reinforcement Learning and Motion Planning

verfasst von : Guillaume Matheron, Nicolas Perrin, Olivier Sigaud

Erschienen in: Artificial Neural Networks and Machine Learning – ICANN 2020

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The exploration-exploitation trade-off is at the heart of reinforcement learning (RL). However, most continuous control benchmarks used in recent RL research only require local exploration. This led to the development of algorithms that have basic exploration capabilities, and behave poorly in benchmarks that require more versatile exploration. For instance, as demonstrated in our empirical study, state-of-the-art RL algorithms such as DDPG and TD3 are unable to steer a point mass in even small 2D mazes. In this paper, we propose a new algorithm called “Plan, Backplay, Chain Skills” (PBCS) that combines motion planning and reinforcement learning to solve hard exploration environments. In a first phase, a motion planning algorithm is used to find a single good trajectory, then an RL algorithm is trained using a curriculum derived from the trajectory, by combining a variant of the Backplay algorithm and skill chaining. We show that this method outperforms state-of-the-art RL algorithms in 2D maze environments of various sizes, and is able to improve on the trajectory obtained by the motion planning phase.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Benureau, F.C.Y., Oudeyer, P.Y.: Behavioral diversity generation in autonomous exploration through reuse of past experience. Front. Robot. AI 3, 8 (2016)CrossRef Benureau, F.C.Y., Oudeyer, P.Y.: Behavioral diversity generation in autonomous exploration through reuse of past experience. Front. Robot. AI 3, 8 (2016)CrossRef
4.
Zurück zum Zitat Chiang, H.T.L., Hsu, J., Fiser, M., Tapia, L., Faust, A.: RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies. arXiv:1907.04799 (2019) Chiang, H.T.L., Hsu, J., Fiser, M., Tapia, L., Faust, A.: RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies. arXiv:​1907.​04799 (2019)
5.
6.
Zurück zum Zitat Colas, C., Sigaud, O., Oudeyer, P.Y.: GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms. arXiv:1802.05054 (2018) Colas, C., Sigaud, O., Oudeyer, P.Y.: GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms. arXiv:​1802.​05054 (2018)
7.
Zurück zum Zitat Cully, A., Demiris, Y.: Quality and Diversity Optimization: a unifying Modular Framework. IEEE Trans. Evol. Comput. 1 (2017) Cully, A., Demiris, Y.: Quality and Diversity Optimization: a unifying Modular Framework. IEEE Trans. Evol. Comput. 1 (2017)
8.
Zurück zum Zitat Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: Go-Explore: a New Approach for Hard-Exploration Problems. arXiv:1901.10995 (2019) Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: Go-Explore: a New Approach for Hard-Exploration Problems. arXiv:​1901.​10995 (2019)
9.
Zurück zum Zitat Erickson, L.H., LaValle, S.M.: Survivability: measuring and ensuring path diversity. In: 2009 IEEE International Conference on Robotics and Automation, pp. 2068–2073 (2009) Erickson, L.H., LaValle, S.M.: Survivability: measuring and ensuring path diversity. In: 2009 IEEE International Conference on Robotics and Automation, pp. 2068–2073 (2009)
10.
Zurück zum Zitat Eysenbach, B., Gupta, A., Ibarz, J., Levine, S.: Diversity is All You Need: Learning Skills without a Reward Function. arXiv:1802.06070 (2018) Eysenbach, B., Gupta, A., Ibarz, J., Levine, S.: Diversity is All You Need: Learning Skills without a Reward Function. arXiv:​1802.​06070 (2018)
11.
Zurück zum Zitat Faust, A., et al.: PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning. arXiv:1710.03937 (2018) Faust, A., et al.: PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning. arXiv:​1710.​03937 (2018)
12.
Zurück zum Zitat Florensa, C., Held, D., Wulfmeier, M., Zhang, M., Abbeel, P.: Reverse Curriculum Generation for Reinforcement Learning. arXiv:1707.05300 (2018) Florensa, C., Held, D., Wulfmeier, M., Zhang, M., Abbeel, P.: Reverse Curriculum Generation for Reinforcement Learning. arXiv:​1707.​05300 (2018)
13.
Zurück zum Zitat Fournier, P., Sigaud, O., Colas, C., Chetouani, M.: CLIC: Curriculum Learning and Imitation for object Control in non-rewarding environments. arXiv:1901.09720 (2019) Fournier, P., Sigaud, O., Colas, C., Chetouani, M.: CLIC: Curriculum Learning and Imitation for object Control in non-rewarding environments. arXiv:​1901.​09720 (2019)
14.
Zurück zum Zitat Fujimoto, S., Hoof, H.v., Meger, D.: Addressing Function Approximation Error in Actor-Critic Methods. ICML (2018) Fujimoto, S., Hoof, H.v., Meger, D.: Addressing Function Approximation Error in Actor-Critic Methods. ICML (2018)
15.
17.
Zurück zum Zitat van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep Reinforcement Learning and the Deadly Triad. arXiv:1812.02648 (2018) van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep Reinforcement Learning and the Deadly Triad. arXiv:​1812.​02648 (2018)
18.
Zurück zum Zitat Hosu, I.A., Rebedea, T.: Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay. arXiv:1607.05077 (2016) Hosu, I.A., Rebedea, T.: Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay. arXiv:​1607.​05077 (2016)
19.
Zurück zum Zitat Knepper, R.A., Mason, M.T.: Path diversity is only part of the problem. In: 2009 IEEE International Conference on Robotics and Automation, pp. 3224–3229 (2009) Knepper, R.A., Mason, M.T.: Path diversity is only part of the problem. In: 2009 IEEE International Conference on Robotics and Automation, pp. 3224–3229 (2009)
20.
Zurück zum Zitat Konidaris, G., Barto, A.G.: Skill discovery in continuous reinforcement learning domains using skill chaining. In: Bengio, Y., et al. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 1015–1023 (2009) Konidaris, G., Barto, A.G.: Skill discovery in continuous reinforcement learning domains using skill chaining. In: Bengio, Y., et al. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 1015–1023 (2009)
21.
Zurück zum Zitat Konidaris, G., Kuindersma, S., Grupen, R., Barto, A.G.: Constructing skill trees for reinforcement learning agents from demonstration trajectories. In: Lafferty, J.D., et al. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 1162–1170 (2010) Konidaris, G., Kuindersma, S., Grupen, R., Barto, A.G.: Constructing skill trees for reinforcement learning agents from demonstration trajectories. In: Lafferty, J.D., et al. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 1162–1170 (2010)
22.
Zurück zum Zitat Lavalle, S.M.: Rapidly-Exploring Random Trees: A New Tool for Path Planning. Iowa State University, Technical report (1998) Lavalle, S.M.: Rapidly-Exploring Random Trees: A New Tool for Path Planning. Iowa State University, Technical report (1998)
24.
Zurück zum Zitat Matheron, G., Perrin, N., Sigaud, O.: The problem with DDPG: understanding failures in deterministic environments with sparse rewards. arXiv:1911.11679 (2019) Matheron, G., Perrin, N., Sigaud, O.: The problem with DDPG: understanding failures in deterministic environments with sparse rewards. arXiv:​1911.​11679 (2019)
26.
Zurück zum Zitat Morere, P., Francis, G., Blau, T., Ramos, F.: Reinforcement Learning with Probabilistically Complete Exploration. arXiv:2001.06940 (2020) Morere, P., Francis, G., Blau, T., Ramos, F.: Reinforcement Learning with Probabilistically Complete Exploration. arXiv:​2001.​06940 (2020)
27.
Zurück zum Zitat Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Overcoming Exploration in Reinforcement Learning with Demonstrations. arXiv:1709.10089 (2018) Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Overcoming Exploration in Reinforcement Learning with Demonstrations. arXiv:​1709.​10089 (2018)
28.
Zurück zum Zitat Ng, A.Y., Harada, D., Russell, S.J.: policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 278–287 (1999) Ng, A.Y., Harada, D., Russell, S.J.: policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 278–287 (1999)
30.
31.
Zurück zum Zitat Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven Exploration by Self-supervised Prediction. arXiv:1705.05363 (2017) Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven Exploration by Self-supervised Prediction. arXiv:​1705.​05363 (2017)
32.
Zurück zum Zitat Penedones, H., Vincent, D., Maennel, H., Gelly, S., Mann, T., Barreto, A.: Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem. arXiv:1807.03064 (2018) Penedones, H., Vincent, D., Maennel, H., Gelly, S., Mann, T., Barreto, A.: Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem. arXiv:​1807.​03064 (2018)
33.
Zurück zum Zitat Pugh, J.K., Soros, L.B., Szerlip, P.A., Stanley, K.O.: Confronting the challenge of quality diversity. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 967–974, GECCO 2015. ACM, New York (2015) Pugh, J.K., Soros, L.B., Szerlip, P.A., Stanley, K.O.: Confronting the challenge of quality diversity. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 967–974, GECCO 2015. ACM, New York (2015)
34.
Zurück zum Zitat Pugh, J.K., Soros, L.B., Stanley, K.O.: Quality diversity: a new frontier for evolutionary computation. Front. Robot. AI 3, 40 (2016)CrossRef Pugh, J.K., Soros, L.B., Stanley, K.O.: Quality diversity: a new frontier for evolutionary computation. Front. Robot. AI 3, 40 (2016)CrossRef
35.
Zurück zum Zitat Resnick, C., Raileanu, R., Kapoor, S., Peysakhovich, A., Cho, K., Bruna, J.: Backplay: “Man muss immer umkehren’. arXiv:1807.06919 (2018) Resnick, C., Raileanu, R., Kapoor, S., Peysakhovich, A., Cho, K., Bruna, J.: Backplay: “Man muss immer umkehren’. arXiv:​1807.​06919 (2018)
39.
40.
Zurück zum Zitat Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms. arXiv:1707.06347 (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms. arXiv:​1707.​06347 (2017)
41.
Zurück zum Zitat Stadie, B.C., Levine, S., Abbeel, P.: Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. arXiv:1507.00814 (2015) Stadie, B.C., Levine, S., Abbeel, P.: Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. arXiv:​1507.​00814 (2015)
42.
Zurück zum Zitat Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)MATH Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)MATH
43.
Metadaten
Titel
PBCS: Efficient Exploration and Exploitation Using a Synergy Between Reinforcement Learning and Motion Planning
verfasst von
Guillaume Matheron
Nicolas Perrin
Olivier Sigaud
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-61616-8_24