A survey of robot learning from demonstration

https://doi.org/10.1016/j.robot.2008.10.024Get rights and content

Abstract

We present a comprehensive survey of robot Learning from Demonstration (LfD), a technique that develops policies from example state to action mappings. We introduce the LfD design choices in terms of demonstrator, problem space, policy derivation and performance, and contribute the foundations for a structure in which to categorize LfD research. Specifically, we analyze and categorize the multiple ways in which examples are gathered, ranging from teleoperation to imitation, as well as the various techniques for policy derivation, including matching functions, dynamics models and plans. To conclude we discuss LfD limitations and related promising areas for future research.

Introduction

The problem of learning a mapping between world state and actions lies at the heart of many robotics applications. This mapping, also called a policy, enables a robot to select an action based upon its current world state. The development of policies by hand is often very challenging and as a result machine learning techniques have been applied to policy development. In this survey, we examine a particular approach to policy learning, Learning from Demonstration (LfD).

Within LfD, a policy is learned from examples, or demonstrations, provided by a teacher. We define examples as sequences of state–action pairs that are recorded during the teacher’s demonstration of the desired robot behavior. LfD algorithms utilize this dataset of examples to derive a policy that reproduces the demonstrated behavior. This approach to obtaining a policy is in contrast to other techniques in which a policy is learned from experience, for example building a policy based on data acquired through exploration, as in Reinforcement Learning [1]. We note that a policy derived under LfD is necessarily defined only in those states encountered, and for those corresponding actions taken, during the example executions.

In this article, we present a survey of recent work within the LfD community, focusing specifically on robotic applications. We segment the LfD learning problem into two fundamental phases: gathering the examples, and deriving a policy from such examples. Based on our identification of the defining features of these techniques, we contribute a comprehensive survey and categorization of existing LfD approaches. Though LfD has been applied to a variety of robotics problems, to our knowledge there exists no established structure for concretely placing work within the larger community. In general, approaches are appropriately contrasted to similar or seminal research, but their relation to the remainder of the field lies largely unaddressed. Establishing these relations is further complicated by dealing with real world robotic platforms, for which the physical details between implementations may vary greatly and yet employ fundamentally identical learning techniques, or vice versa. A categorical structure therefore aids in comparative assessments among applications, as well as in identifying open areas for future research. In contributing our categorization of current approaches, we aim to lay the foundations for such a structure.

For the remainder of this section we motivate the application of LfD to robotics, and present a formal definition of the LfD problem. Section 2 presents the key design decisions for an LfD system. Methods for gathering demonstration examples are the focus of Section 3, where the various approaches to teacher demonstration and data recording are discussed. Section 4 examines the core techniques for policy derivation within LfD, followed in Section 5 by methods for improving robot performance beyond the capabilities of the teacher examples. To conclude, we identify and discuss open areas of research for future work in Section 6 and summarize the article with Section 7.

The presence of robots within society is becoming ever more prevalent. Whether an exploration rover in space, robot soccer or a recreational robot for the home, successful autonomous robot operation requires robust control algorithms. Non-robotics-experts may be increasingly presented with opportunities to interact with robots, and it is reasonable to expect that they have ideas about what a robot should do, and therefore what sort of behaviors these control algorithms should produce. A natural, and practical, extension of having this knowledge is to actually develop the desired control algorithm. Currently, however, policy development is a complex process restricted to experts within the field.

Traditional approaches to robot control model domain dynamics and derive mathematically-based policies. Though theoretically well-founded, these approaches depend heavily upon the accuracy of the world model. Not only does this model require considerable expertise to develop, but approximations such as linearization are often introduced for computational tractability, thereby degrading performance. Other approaches, such as Reinforcement Learning, guide policy learning by providing reward feedback about the desirability of visiting particular states. To define a function to provide the reward, however, is known to be difficult and requires considerable expertise to address. Furthermore, building the policy requires gathering information by visiting states to receive rewards, which is non-trivial for a robot learner executing actual actions in the real world.

Considering these challenges, LfD has many attractive points for both learner and teacher. LfD formulations typically do not require expert knowledge of the domain dynamics, which removes performance brittleness resulting from model simplifications. The absence of this expert domain knowledge requirement also opens policy development to non-robotics-experts, satisfying a need that increases as robots become more commonplace. Furthermore, demonstration has the attractive feature of being an intuitive medium for communication from humans, who already use demonstration to teach other humans. Demonstration also has the practical feature of focusing the dataset to areas of the state–space actually encountered during task execution.

LfD can be seen as a subset of Supervised Learning. In Supervised Learning the agent is presented with labeled training data and learns an approximation to the function which produced the data. Within LfD, this training dataset is composed of example executions of the task by a demonstration teacher (Fig. 1, top).

We formally construct the LfD problem as follows. The world consists of states S and actions A, with the mapping between states by way of actions being defined by a probabilistic transition function T(s|s,a):S×A×S[0,1]. We assume that the state is not fully observable. The learner instead has access to observed state Z, through the mapping M:SZ. A policy π:ZA selects actions based on observations of the world state. A single cycle of policy execution at time t is shown in Fig. 1 (bottom).

The set A ranges from containing low-level motions to high-level behaviors. For some simulated world applications, state may be fully transparent, in which case M=I, the identity mapping. For all other applications state is not fully transparent and must be observed, for example through sensors in the real world. For succinctness, throughout the text we will use “state” interchangeably with “observed state.” It should be assumed, however, that state is always the observed state, unless explicitly noted otherwise. This assumption will be reinforced by use of the Z notation throughout the text.

Throughout the teacher execution, states and selected actions are recorded. We represent a demonstration djD formally as kj pairs of observations and actions: dj={(zji,aji)},zjiZ,ajiA,i=0kj. These demonstrations set LfD apart from other learning approaches. The set D of the demonstrations is made available to the learner. The policy derived from this dataset enables the learner to select an action based on the current state.

Before continuing, we pause to place the intents of this survey within the context of previous LfD literature. The aim of this survey is to review the broad topic of LfD, to provide a categorization that highlights differences between approaches, and to identify research areas within LfD that have not yet been explored.

We begin with a comment on terminology. Demonstration-based learning techniques are described by a variety of terms within the published literature, including Learning by Demonstration (LbD), Learning from Demonstration (LfD), Programming by Demonstration (PbD), Learning by Experienced Demonstrations, Assembly Plan from Observation, Learning by Showing, Learning by Watching, Learning from Observation, behavioral cloning, imitation and mimicry. While the definitions for some of these terms, such as imitation, have been loosely borrowed from other sciences, the overall use of these terms is often inconsistent or contradictory across articles.

Within this article, we refer to the general category of algorithms in which a policy is derived based on demonstrated data as Learning from Demonstration (LfD). Within this category, we further distinguish between approaches by their various characteristics, as outlined in Section 2, such as the source of the demonstrations and the learning techniques applied. Subsequent sections introduce terms used to characterize algorithmic differences. Due to the already contradictory use of terms in the existing literature, our definitions will not always agree with those of other publications. Our intent, however, is not for others in the field to adopt the terminology presented here, but rather to provide a consistent set of definitions that highlight distinctions between techniques.

Regarding a categorization for approaches, we note that many legitimate criteria could be used to subdivide LfD research. For example, one proposed categorization considers the broad spectrum of who, what, when and how to imitate, or subsets thereof [2], [3]. Our review aims to focus on the specifics of implementation. We therefore categorize approaches according to the computational formulations and techniques required to implement an LfD system.

To conclude, readers may also find useful other related surveys of the LfD research area. In particular, the book Imitation in Animals and Artifacts[4] provides an interdisciplinary overview of research in imitation learning, presenting leading work from neuroscience, psychology and linguistics as well as computer science. A narrower focus is presented in the chapter “Robot Programming by Demonstration” [2] within the book Handbook of Robotics. This work particularly highlights techniques which may augment or combine with traditional LfD, such as giving the teacher an active role during learning. By contrast, our focus is to provide a categorical structure for LfD approaches, in addition to presenting the specifics of implementation. We do refer the reader to this chapter for a more comprehensive historical overview of LfD, as the scope of our survey is restricted to recently published literature. Additional reviews that cover specific sub-areas of LfD research in detail are highlighted throughout the article.

Section snippets

Design choices

There are certain aspects of LfD which are common among all applications to date. One is the fact that a teacher demonstrates execution of a desired behavior. Another is that the learner is provided with a set of these demonstrations, and from them derives a policy able to reproduce the demonstrated behavior.

However, the developer still faces many design choices when developing a new LfD system. Some of these decisions, such as the choice of a discrete or continuous action representation, may

Gathering examples: How the dataset is built

In this section, we discuss various techniques for executing and recording demonstrations. The LfD dataset is composed of state–action pairs recorded during teacher executions of the desired behavior. Exactly how they are recorded, and what the teacher uses as a platform for the execution, varies greatly across approaches. Examples range from sensors on the robot learner recording its own actions as it is passively teleoperated by the teacher, to a camera recording a human teacher as she

Deriving a policy: The source of the state to action mapping

Given a dataset of state–action examples that have been acquired using one of the methods described in the previous section, we now discuss methods for deriving a policy using this data. LfD has seen the development of three core approaches to deriving policies from demonstration data, as summarized in Fig. 2. Learning a policy can involve simply learning an approximation to the state-action mapping (mapping function), or learning a model of the world dynamics and deriving a policy from this

Limitations of the demonstration dataset

LfD systems are inherently linked to the information provided in the demonstration dataset. As a result, learner performance is heavily limited by the quality of this information. In this article we identify two distinct causes for poor learner performance within LfD frameworks and survey the techniques that have been developed to address each limitation. The first cause, discussed in Section 5.1, is due to dataset sparsity, or the existence of areas of the state space that have not been

Future directions

As highlighted by the discussion in the previous sections, current approaches to LfD address a wide variety of problems under many different conditions and assumptions. In this section, we aim to highlight several promising areas of LfD research that have received limited attention, ranging from data representation to issues of system robustness and evaluation metrics.

Conclusion

In this article we have presented a comprehensive survey of Learning from Demonstration (LfD) techniques employed to address the robotics control problem. LfD has the attractive characteristics of being an intuitive communication medium for human teachers and of opening control algorithm development to non-robotics-experts. Additionally, LfD complements many traditional policy learning techniques, offering a solution to some of the weaknesses in traditional approaches. Consequently, LfD has

Acknowledgements

This research is partly sponsored by the Boeing Corporation under Grant No. CMU-BA-GTA-1, BBNT Solutions under subcontract No. 950008572, via prime Air Force contract No. SA-8650-06-C-7606, and the Qatar Foundation for Education, Science and Community Development. The views and conclusions contained in this document are solely those of the authors.

The authors would like to thank J. Andrew Bagnell and Darrin Bentivegna for feedback on the content and scope of this article.

Brenna D. Argall is a Ph.D. candidate in the Robotics Institute at Carnegie Mellon University. Prior to graduate school, she held a Computational Biology position in the Laboratory of Brain and Cognition, at the National Institutes of Health, while investigating visualization techniques for neural fMRI data. Argall received an M.S. in Robotics in 2006, and in 2002 a B.S. in Mathematics, both from Carnegie Mellon. Her research interests focus upon machine learning techniques to develop and

References (105)

  • J. Steil et al.

    Situated robot learning for multi-modal instruction and imitation of grasping

    Robot Learning by Demonstration

    Robotics and Autonomous Systems

    (2004)
  • Y. Demiris et al.

    Hierarchical attentive multiple models for execution and recognition of actions

    The Social Mechanisms of Robot Programming by Demonstration

    Robotics and Autonomous Systems

    (2006)
  • B. Jansen et al.

    A computational model of intention reading in imitation

    The Social Mechanisms of Robot Programming by Demonstration

    Robotics and Autonomous Systems

    (2006)
  • J. Peters et al.

    Natural actor-critic

    Neurocomputing

    (2008)
  • R.S. Sutton et al.

    Reinforcement learning: An introduction

    (1998)
  • A. Billard et al.

    Robot programming by demonstration

  • S. Schaal et al.

    Computational approaches to motor learning by imitation

    Philosophical Transactions: Biological Sciences

    (2003)
  • C.L. Nehaniv et al.
  • A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, E. Liang, Inverted autonomous helicopter flight...
  • B. Browning, L. Xu, M. Veloso, Skill acquisition and use for a dynamically-balancing soccer robot, in: Proceeding of...
  • P.K. Pook, D.H. Ballard, Recognizing teleoperated manipulations, in: Proceedings of the IEEE International Conference...
  • J.D. Sweeney, R.A. Grupen, A model of shared grasp affordances from demonstration, in: Proceedings of the IEEE-RAS...
  • J. Chen et al.

    Programing by demonstration: Coping with suboptimal teaching actions

    The International Journal of Robotics Research

    (2003)
  • T. Inamura, M. Inaba, H. Inoue, Acquisition of probabilistic behavior decision model based on the interactive teaching...
  • W.D. Smart, Making Reinforcement Learning Work on Real Robots. Ph.D. Thesis, Department of Computer Science, Brown...
  • J.A. Clouse, On integrating apprentice learning and reinforcement learning. Ph.D. Thesis, University of Massachusetts,...
  • R.P.N. Rao et al.

    A bayesian model of imitation in infants and robots

  • P. Abbeel, A.Y. Ng, Apprenticeship learning via inverse reinforcement learning, in: Proceedings of the 21st...
  • S. Chernova, M. Veloso, Multi-thresholded approach to demonstration selection for interactive robot learning, in:...
  • R. Aler et al.

    Correcting and improving imitation models of humans for robosoccer agents

    Evolutionary Computation

    (2005)
  • P.E. Rybski, K. Yoon, J. Stolarz, M.M. Veloso, Interactive robot task training through dialog and demonstration, in:...
  • B. Argall, B. Browning, M. Veloso, Learning from demonstration with the critique of a human teacher, in: Proceedings of...
  • D.H. Grollman, O.C. Jenkins, Dogged learning for robots, in: Proceedings of the IEEE International Conference on...
  • M.T. Rosenstein et al.

    Supervised actor-critic reinforcement learning

  • G.Z. Grudic, P.D. Lawrence, Human-to-robot skill transfer using the spore approximation, in: Proceedings of the IEEE...
  • Y. Demiris et al.
  • M.N. Nicolescu, M.J. Matarić, Experience-based representation construction: Learning from human and robot teachers, in:...
  • U. Nehmzow, O. Akanyeti, C. Weinrich, T. Kyriacou, S. Billings, Robot programming by demonstration through system...
  • A.J. Ijspeert, J. Nakanishi, S. Schaal, Movement imitation with nonlinear dynamical systems in humanoid robots, in:...
  • S. Calinon, A. Billard, Incremental learning of gestures by imitation in a humanoid robot, in: Proceedings of the 2nd...
  • C.G. Atkeson, S. Schaal, Robot learning from demonstration, in: Proceedings of the Fourteenth International Conference...
  • D.C. Bentivegna, A. Ude, C.G. Atkeson, G. Cheng, Humanoid robot learning and game playing using PC-based vision, in:...
  • R. Amit, M. Matarić, Learning movement sequences from demonstration, in: Proceedings of the 2nd International...
  • N. Pollard, J.K. Hodgins, Generalizing demonstrated manipulation tasks, in: Workshop on the Algorithmic Foundations of...
  • I. Infantino et al.

    A posture sequence learning system for an anthropomorphic robotic hand

    Robotics and Autonomous Systems

    (2004)
  • M. Lopes et al.

    Visual learning by imitation with motor representations

    IEEE Transactions on Systems, Man, and Cybernetics, Part B

    (2005)
  • R.M. Voyles et al.

    A multi-agent system for programming robots by human demonstration

    Integrated Computer-Aided Engineering

    (2001)
  • J. Lieberman, C. Breazeal, Improvements on action parsing and action interpolation for learning through demonstration,...
  • M.J. Matarić
  • Cited by (0)

    Brenna D. Argall is a Ph.D. candidate in the Robotics Institute at Carnegie Mellon University. Prior to graduate school, she held a Computational Biology position in the Laboratory of Brain and Cognition, at the National Institutes of Health, while investigating visualization techniques for neural fMRI data. Argall received an M.S. in Robotics in 2006, and in 2002 a B.S. in Mathematics, both from Carnegie Mellon. Her research interests focus upon machine learning techniques to develop and improve robot control systems, under the guidance of a human teacher.

    Sonia Chernova is a Ph.D. student in the Computer Science Department at Carnegie Mellon University. She received her undergraduate degree in Computer Science and robotics from Carnegie Mellon University in 2003. Her research interests include learning and interaction in robotic systems.

    Manuela Veloso is Herbert A. Simon Professor of Computer Science at Carnegie Mellon University. She received a licenciatura in Electrical Engineering in 1980, and an M.Sc. in Electrical and Computer Engineering in 1984 from the Instituto Superior Tecnico in Lisbon. She earned her Ph.D. in Computer Science from Carnegie Mellon in 1992. Veloso researches in planning, control learning, and execution algorithms, in particular for multi-robot teams. With her students, Veloso has developed teams of robot soccer agents, which have been RoboCup world champions several times. She is a Fellow of AAAI, the Association for the Advancement of Artificial Intelligence, an IEEE Senior member, and the President Elect (2008) of the International RoboCup Federation.

    Brett Browning is a Senior Systems Scientist in Carnegie Mellon University’s School of Computer Science, where he has been a faculty member of the Robotics Institute since 2002. Prior to that, he was a postdoctoral fellow at Carnegie Mellon working with Manuela Veloso. Browning received his Ph.D. from the University of Queensland in 2000, and a B.Electrical Engineer and B.Sc (Math) from the same institution in 1996. His research interests are on robot autonomy and in particular real-time robot perception, applied machine learning, and teamwork.

    View full text