Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning
Introduction
Recently, there have been many attempts to apply reinforcement learning (RL) algorithms to the acquisition of goal-directed behaviors in autonomous robots. However, a crucial issue in applying RL to real-world robot control is the curse of dimensionality. For example, control of a humanoid robot easily involves a forty- or higher-dimensional state space. Thus, the usual way of quantizing the state space with grids easily breaks down. We have recently developed RL algorithms for dealing with continuous-time, continuous-state control tasks without explicit quantization of state and time [6]. However, there is still a need to develop methods for high-dimensional function approximation and for global exploration. The speed of learning is crucial in applying RL to real hardware control because, unlike in idealized simulations, such non-stationary effects as sensor drift and mechanical aging are not negligible and learning has to be quick enough to keep track of such changes in the environment.
In this paper, we propose a hierarchical RL architecture that realizes a practical learning speed in high-dimensional control tasks. Hierarchical RL methods have been developed for creating reusable behavioral modules [4], [21], [25], solving partially observable Markov decision problems (POMDPs) [26], and for improving learning speed [3], [10].
Many hierarchical RL methods use coarse and fine grain quantization of the state space. However, in a high-dimensional state space, even the coarsest quantization into two bins in each dimension would create a prohibitive number of states. Thus, in designing a hierarchical RL architecture in high-dimensional space, it is essential to reduce the dimensions of the state space [16].
In this study, we propose a hierarchical RL architecture in which the upper-level learner globally explores sequences of sub-goals in a low-dimensional state space, while the lower-level learners optimize local trajectories in the high-dimensional state space.
As a concrete example, we consider a “stand-up” task for a two-joint, three-link robot (see Fig. 1). The goal of the task is to find a path in a high-dimensional state space that links a lying state to an upright state under the constraints of the system dynamics. The robot is a non-holonomic system, as there is no actuator linking the robot to the ground, and thus trajectory planning is non-trivial. The geometry of the robot is such that there is no static solution; the robot has to stand up dynamically by utilizing the momentum of its body.
This paper is organized as follows. In Section 2, we explain the proposed hierarchical RL method. In Section 3, we show simulation results of the stand-up task using the proposed method and compare the performance with non-hierarchical RL. In Section 4, we describe our real robot and system configuration and show results of the stand-up task with a real robot using the proposed method. In Section 5, we discuss the difference between our method and previous methods in terms of hierarchical RL, RL using real robots, and the stand-up task. Finally, we conclude this paper in Section 6.
Section snippets
Hierarchical reinforcement learning
In this section, we propose a hierarchical RL architecture for non-linear control problems. The basic idea is to decompose a non-linear problem in a high-dimensional state space into two levels: a non-linear problem in a lower-dimensional space and nearly-linear problems in the high-dimensional space (see Fig. 2).
Simulations
First, we show simulation results of the stand-up task with a two-joint, three-link robot using the hierarchical RL architecture. We then investigate the basic properties of the hierarchical architecture in a simplified stand-up task with one joint. We show how the performance changes with the action step size in the upper level. We also compare the performance between the hierarchical RL architectures and non-hierarchical RL architectures. Finally, we show the role of the upper-level reward R
Real robot experiments
Next, we applied the hierarchical RL to a real robot. As the initial condition for the real robot learning, we used the sub-goal sequence and non-linear controllers acquired by the simulation in Section 3.1. We then applied the hierarchical RL to a real robot (see configuration in Fig. 11).
We used a PC/AT with a Pentium 233 MHz CPU and RT-Linux as the operating system for controlling the robot (see Fig. 12). The time step of the lower-level learning was Δt=0.01 [s], and that of the servo control
Discussion
In this section, we summarize the achievement of this study in relation to the previous studies of the hierarchical RL, RL using real robots, and the stand-up task for robots.
Conclusions
We proposed a hierarchical RL architecture that uses a low-dimensional state representation in the upper level. The stand-up task was accomplished by the hierarchical RL architecture using a real, two-joint, three-link robot. We showed that the hierarchical RL architecture achieved the task much faster and more robustly than a plain RL architecture. We also showed that successful stand-up was not so sensitive to the choice of the upper-level step size and that upper-level reward Rsub was
Acknowledgements
We would like to thank Mitsuo Kawato, Stefan Schaal, Christopher G. Atkeson, Tsukasa Ogasawara, Kazuyuki Samejima, Andrew G. Barto, and the anonymous reviewers for their helpful comments.
Jun Morimoto received his B.E. in Computer-Controlled Mechanical Systems from Osaka University in 1996, M.E. in Information Science from Nara Institute of Science and Technology in 1998, and Ph.D. in Information Science from Nara Institute of Science and Technology in 2001. He was a Research Assistant at Kawato Dynamic Brain Project, ERATO, JST in 1999. He is now a postdoctoral fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania. He is a member of Japanese
References (28)
- et al.
RoboCup: Today and tomorrow — What we have learned
Artificial Intelligence
(1999) - et al.
Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development
Artificial Intelligence
(1999) Learning social behaviors
Robotics and Autonomous Systems
(1997)Reinforcement learning of multiple tasks using a hierarchical CMAC architecture
Robotics and Autonomous Systems
(1995)- P. Dayan, G.E. Hinton, Feudal reinforcement learning, in: Advances in Neural Information Processing Systems, Vol. 5,...
- B.L. Digney, Learning hierarchical control structures for multiple tasks and changing environments, in: Proceedings of...
- K. Doya, Efficient nonlinear control with actor–tutor architecture, in: M.C. Mozer, M.I. Jordan, T. Petsche (Eds.),...
Reinforcement learning in continuous time and space
Neural Computation
(2000)- M. Inaba, I. Igarashi, K. Kagami, I. Hirochika, A 35 DOF humanoid that can coordinate arms and legs in standing up,...
- et al.
Hierarchical mixtures of experts and the EM algorithm
Neural Computation
(1994)
Cited by (184)
A survey on control of humanoid fall over
2023, Robotics and Autonomous SystemsSocial impact and governance of AI and neurotechnologies
2022, Neural NetworksDeep learning, reinforcement learning, and world models
2022, Neural NetworksLayered Relative Entropy Policy Search
2021, Knowledge-Based SystemsCitation Excerpt :The subtasks have also been used in continuous problems. In [5], a hierarchical RL approach was proposed in which the subtasks specified the desired configurations of the robot joints. Another method is hierarchical policy gradient [6] in which individual subtasks were learned using policy gradient, whereas subtask selection was learned using value function-based methods.
50 Years Since the Marr, Ito, and Albus Models of the Cerebellum
2021, NeuroscienceCitation Excerpt :In most successful engineering applications of hierarchical reinforcement learning algorithms in robotics, higher-level abstract representations and/or sub-goals of the bottom layer were determined by researchers (Atkeson et al., 2000). So far, intermediate goal postures for a standing robot were manually selected as representations in the top layer by Morimoto and Doya (2001). As another example, Bentivegna et al., (2003) selected right-bank shots as higher-level actions in air-hockey by a humanoid robot DB.
Jun Morimoto received his B.E. in Computer-Controlled Mechanical Systems from Osaka University in 1996, M.E. in Information Science from Nara Institute of Science and Technology in 1998, and Ph.D. in Information Science from Nara Institute of Science and Technology in 2001. He was a Research Assistant at Kawato Dynamic Brain Project, ERATO, JST in 1999. He is now a postdoctoral fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania. He is a member of Japanese Neural Network Society, and Robotics Society of Japan. He received Young Investigator Award from Japanese Neural Network Society in 2000. His research interests include reinforcement learning and robotics.
Kenji Doya received his B.S., M.S., and Ph.D. in Mathematical Engineering from University of Tokyo in 1984, 1986, and 1991, respectively. He was a Research Associate at University of Tokyo in 1986, a post-graduate researcher at the Department of Biology, UCSD in 1991, and a Research Associate of Howard Hughes Medical Institute at Computational Neurobiology Laboratory, Salk Institute in 1993. He took the positions of a Senior Researcher at ATR Human Information Processing Research Laboratories in 1994, the leader of Computational Neurobiology Group at Kawato Dynamic Brain Project, ERATO, JST in 1996, and the leader of Neuroinformatics Project at Information Sciences Division, ATR International in 2000. He has been appointed as a visiting Associated Professor at Nara Institute of Science and Technology since 1995, and the Director of Metalearning, Neuromodulation, and Emotion Research, CREST, JST since 1999. He is an Action Editor of Neural Networks and Neural Computation, a board member of Japanese Neural Network Society, and a member of Society for Neuroscience and International Neural Network Society. His research interests include non-linear dynamics, reinforcement learning, the functions of the basal ganglia and the cerebellum, and the roles of neuromodulators in metalearning.