Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning

https://doi.org/10.1016/S0921-8890(01)00113-0Get rights and content

Abstract

In this paper, we propose a hierarchical reinforcement learning architecture that realizes practical learning speed in real hardware control tasks. In order to enable learning in a practical number of trials, we introduce a low-dimensional representation of the state of the robot for higher-level planning. The upper level learns a discrete sequence of sub-goals in a low-dimensional state space for achieving the main goal of the task. The lower-level modules learn local trajectories in the original high-dimensional state space to achieve the sub-goal specified by the upper level.

We applied the hierarchical architecture to a three-link, two-joint robot for the task of learning to stand up by trial and error. The upper-level learning was implemented by Q-learning, while the lower-level learning was implemented by a continuous actor–critic method. The robot successfully learned to stand up within 750 trials in simulation and then in an additional 170 trials using real hardware. The effects of the setting of the search steps in the upper level and the use of a supplementary reward for achieving sub-goals are also tested in simulation.

Introduction

Recently, there have been many attempts to apply reinforcement learning (RL) algorithms to the acquisition of goal-directed behaviors in autonomous robots. However, a crucial issue in applying RL to real-world robot control is the curse of dimensionality. For example, control of a humanoid robot easily involves a forty- or higher-dimensional state space. Thus, the usual way of quantizing the state space with grids easily breaks down. We have recently developed RL algorithms for dealing with continuous-time, continuous-state control tasks without explicit quantization of state and time [6]. However, there is still a need to develop methods for high-dimensional function approximation and for global exploration. The speed of learning is crucial in applying RL to real hardware control because, unlike in idealized simulations, such non-stationary effects as sensor drift and mechanical aging are not negligible and learning has to be quick enough to keep track of such changes in the environment.

In this paper, we propose a hierarchical RL architecture that realizes a practical learning speed in high-dimensional control tasks. Hierarchical RL methods have been developed for creating reusable behavioral modules [4], [21], [25], solving partially observable Markov decision problems (POMDPs) [26], and for improving learning speed [3], [10].

Many hierarchical RL methods use coarse and fine grain quantization of the state space. However, in a high-dimensional state space, even the coarsest quantization into two bins in each dimension would create a prohibitive number of states. Thus, in designing a hierarchical RL architecture in high-dimensional space, it is essential to reduce the dimensions of the state space [16].

In this study, we propose a hierarchical RL architecture in which the upper-level learner globally explores sequences of sub-goals in a low-dimensional state space, while the lower-level learners optimize local trajectories in the high-dimensional state space.

As a concrete example, we consider a “stand-up” task for a two-joint, three-link robot (see Fig. 1). The goal of the task is to find a path in a high-dimensional state space that links a lying state to an upright state under the constraints of the system dynamics. The robot is a non-holonomic system, as there is no actuator linking the robot to the ground, and thus trajectory planning is non-trivial. The geometry of the robot is such that there is no static solution; the robot has to stand up dynamically by utilizing the momentum of its body.

This paper is organized as follows. In Section 2, we explain the proposed hierarchical RL method. In Section 3, we show simulation results of the stand-up task using the proposed method and compare the performance with non-hierarchical RL. In Section 4, we describe our real robot and system configuration and show results of the stand-up task with a real robot using the proposed method. In Section 5, we discuss the difference between our method and previous methods in terms of hierarchical RL, RL using real robots, and the stand-up task. Finally, we conclude this paper in Section 6.

Section snippets

Hierarchical reinforcement learning

In this section, we propose a hierarchical RL architecture for non-linear control problems. The basic idea is to decompose a non-linear problem in a high-dimensional state space into two levels: a non-linear problem in a lower-dimensional space and nearly-linear problems in the high-dimensional space (see Fig. 2).

Simulations

First, we show simulation results of the stand-up task with a two-joint, three-link robot using the hierarchical RL architecture. We then investigate the basic properties of the hierarchical architecture in a simplified stand-up task with one joint. We show how the performance changes with the action step size in the upper level. We also compare the performance between the hierarchical RL architectures and non-hierarchical RL architectures. Finally, we show the role of the upper-level reward R

Real robot experiments

Next, we applied the hierarchical RL to a real robot. As the initial condition for the real robot learning, we used the sub-goal sequence and non-linear controllers acquired by the simulation in Section 3.1. We then applied the hierarchical RL to a real robot (see configuration in Fig. 11).

We used a PC/AT with a Pentium 233 MHz CPU and RT-Linux as the operating system for controlling the robot (see Fig. 12). The time step of the lower-level learning was Δt=0.01 [s], and that of the servo control

Discussion

In this section, we summarize the achievement of this study in relation to the previous studies of the hierarchical RL, RL using real robots, and the stand-up task for robots.

Conclusions

We proposed a hierarchical RL architecture that uses a low-dimensional state representation in the upper level. The stand-up task was accomplished by the hierarchical RL architecture using a real, two-joint, three-link robot. We showed that the hierarchical RL architecture achieved the task much faster and more robustly than a plain RL architecture. We also showed that successful stand-up was not so sensitive to the choice of the upper-level step size and that upper-level reward Rsub was

Acknowledgements

We would like to thank Mitsuo Kawato, Stefan Schaal, Christopher G. Atkeson, Tsukasa Ogasawara, Kazuyuki Samejima, Andrew G. Barto, and the anonymous reviewers for their helpful comments.

Jun Morimoto received his B.E. in Computer-Controlled Mechanical Systems from Osaka University in 1996, M.E. in Information Science from Nara Institute of Science and Technology in 1998, and Ph.D. in Information Science from Nara Institute of Science and Technology in 2001. He was a Research Assistant at Kawato Dynamic Brain Project, ERATO, JST in 1999. He is now a postdoctoral fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania. He is a member of Japanese

References (28)

  • F. Kanehiro, M. Inaba, H. Inoue, Development of a two-armed bipedal robot that can walk and carry objects, in:...
  • H. Kimura, S. Kobayashi, Efficient non-linear control by combining Q-learning with local linear controllers, in:...
  • F. Kirchner, Q-learning of complex behaviours on a six-legged walking machine, in: Proceedings of the Second EUROMICRO...
  • Y. Kuniyoshi, A. Nagakubo, Humanoid as a research vehicle into flexible complex interaction, in: Proceedings of the...
  • Cited by (184)

    • A survey on control of humanoid fall over

      2023, Robotics and Autonomous Systems
    • Layered Relative Entropy Policy Search

      2021, Knowledge-Based Systems
      Citation Excerpt :

      The subtasks have also been used in continuous problems. In [5], a hierarchical RL approach was proposed in which the subtasks specified the desired configurations of the robot joints. Another method is hierarchical policy gradient [6] in which individual subtasks were learned using policy gradient, whereas subtask selection was learned using value function-based methods.

    • 50 Years Since the Marr, Ito, and Albus Models of the Cerebellum

      2021, Neuroscience
      Citation Excerpt :

      In most successful engineering applications of hierarchical reinforcement learning algorithms in robotics, higher-level abstract representations and/or sub-goals of the bottom layer were determined by researchers (Atkeson et al., 2000). So far, intermediate goal postures for a standing robot were manually selected as representations in the top layer by Morimoto and Doya (2001). As another example, Bentivegna et al., (2003) selected right-bank shots as higher-level actions in air-hockey by a humanoid robot DB.

    View all citing articles on Scopus

    Jun Morimoto received his B.E. in Computer-Controlled Mechanical Systems from Osaka University in 1996, M.E. in Information Science from Nara Institute of Science and Technology in 1998, and Ph.D. in Information Science from Nara Institute of Science and Technology in 2001. He was a Research Assistant at Kawato Dynamic Brain Project, ERATO, JST in 1999. He is now a postdoctoral fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania. He is a member of Japanese Neural Network Society, and Robotics Society of Japan. He received Young Investigator Award from Japanese Neural Network Society in 2000. His research interests include reinforcement learning and robotics.

    Kenji Doya received his B.S., M.S., and Ph.D. in Mathematical Engineering from University of Tokyo in 1984, 1986, and 1991, respectively. He was a Research Associate at University of Tokyo in 1986, a post-graduate researcher at the Department of Biology, UCSD in 1991, and a Research Associate of Howard Hughes Medical Institute at Computational Neurobiology Laboratory, Salk Institute in 1993. He took the positions of a Senior Researcher at ATR Human Information Processing Research Laboratories in 1994, the leader of Computational Neurobiology Group at Kawato Dynamic Brain Project, ERATO, JST in 1996, and the leader of Neuroinformatics Project at Information Sciences Division, ATR International in 2000. He has been appointed as a visiting Associated Professor at Nara Institute of Science and Technology since 1995, and the Director of Metalearning, Neuromodulation, and Emotion Research, CREST, JST since 1999. He is an Action Editor of Neural Networks and Neural Computation, a board member of Japanese Neural Network Society, and a member of Society for Neuroscience and International Neural Network Society. His research interests include non-linear dynamics, reinforcement learning, the functions of the basal ganglia and the cerebellum, and the roles of neuromodulators in metalearning.

    View full text