A survey of robot learning from demonstration
Introduction
The problem of learning a mapping between world state and actions lies at the heart of many robotics applications. This mapping, also called a policy, enables a robot to select an action based upon its current world state. The development of policies by hand is often very challenging and as a result machine learning techniques have been applied to policy development. In this survey, we examine a particular approach to policy learning, Learning from Demonstration (LfD).
Within LfD, a policy is learned from examples, or demonstrations, provided by a teacher. We define examples as sequences of state–action pairs that are recorded during the teacher’s demonstration of the desired robot behavior. LfD algorithms utilize this dataset of examples to derive a policy that reproduces the demonstrated behavior. This approach to obtaining a policy is in contrast to other techniques in which a policy is learned from experience, for example building a policy based on data acquired through exploration, as in Reinforcement Learning [1]. We note that a policy derived under LfD is necessarily defined only in those states encountered, and for those corresponding actions taken, during the example executions.
In this article, we present a survey of recent work within the LfD community, focusing specifically on robotic applications. We segment the LfD learning problem into two fundamental phases: gathering the examples, and deriving a policy from such examples. Based on our identification of the defining features of these techniques, we contribute a comprehensive survey and categorization of existing LfD approaches. Though LfD has been applied to a variety of robotics problems, to our knowledge there exists no established structure for concretely placing work within the larger community. In general, approaches are appropriately contrasted to similar or seminal research, but their relation to the remainder of the field lies largely unaddressed. Establishing these relations is further complicated by dealing with real world robotic platforms, for which the physical details between implementations may vary greatly and yet employ fundamentally identical learning techniques, or vice versa. A categorical structure therefore aids in comparative assessments among applications, as well as in identifying open areas for future research. In contributing our categorization of current approaches, we aim to lay the foundations for such a structure.
For the remainder of this section we motivate the application of LfD to robotics, and present a formal definition of the LfD problem. Section 2 presents the key design decisions for an LfD system. Methods for gathering demonstration examples are the focus of Section 3, where the various approaches to teacher demonstration and data recording are discussed. Section 4 examines the core techniques for policy derivation within LfD, followed in Section 5 by methods for improving robot performance beyond the capabilities of the teacher examples. To conclude, we identify and discuss open areas of research for future work in Section 6 and summarize the article with Section 7.
The presence of robots within society is becoming ever more prevalent. Whether an exploration rover in space, robot soccer or a recreational robot for the home, successful autonomous robot operation requires robust control algorithms. Non-robotics-experts may be increasingly presented with opportunities to interact with robots, and it is reasonable to expect that they have ideas about what a robot should do, and therefore what sort of behaviors these control algorithms should produce. A natural, and practical, extension of having this knowledge is to actually develop the desired control algorithm. Currently, however, policy development is a complex process restricted to experts within the field.
Traditional approaches to robot control model domain dynamics and derive mathematically-based policies. Though theoretically well-founded, these approaches depend heavily upon the accuracy of the world model. Not only does this model require considerable expertise to develop, but approximations such as linearization are often introduced for computational tractability, thereby degrading performance. Other approaches, such as Reinforcement Learning, guide policy learning by providing reward feedback about the desirability of visiting particular states. To define a function to provide the reward, however, is known to be difficult and requires considerable expertise to address. Furthermore, building the policy requires gathering information by visiting states to receive rewards, which is non-trivial for a robot learner executing actual actions in the real world.
Considering these challenges, LfD has many attractive points for both learner and teacher. LfD formulations typically do not require expert knowledge of the domain dynamics, which removes performance brittleness resulting from model simplifications. The absence of this expert domain knowledge requirement also opens policy development to non-robotics-experts, satisfying a need that increases as robots become more commonplace. Furthermore, demonstration has the attractive feature of being an intuitive medium for communication from humans, who already use demonstration to teach other humans. Demonstration also has the practical feature of focusing the dataset to areas of the state–space actually encountered during task execution.
LfD can be seen as a subset of Supervised Learning. In Supervised Learning the agent is presented with labeled training data and learns an approximation to the function which produced the data. Within LfD, this training dataset is composed of example executions of the task by a demonstration teacher (Fig. 1, top).
We formally construct the LfD problem as follows. The world consists of states and actions , with the mapping between states by way of actions being defined by a probabilistic transition function . We assume that the state is not fully observable. The learner instead has access to observed state , through the mapping . A policy selects actions based on observations of the world state. A single cycle of policy execution at time is shown in Fig. 1 (bottom).
The set ranges from containing low-level motions to high-level behaviors. For some simulated world applications, state may be fully transparent, in which case , the identity mapping. For all other applications state is not fully transparent and must be observed, for example through sensors in the real world. For succinctness, throughout the text we will use “state” interchangeably with “observed state.” It should be assumed, however, that state is always the observed state, unless explicitly noted otherwise. This assumption will be reinforced by use of the notation throughout the text.
Throughout the teacher execution, states and selected actions are recorded. We represent a demonstration formally as pairs of observations and actions: . These demonstrations set LfD apart from other learning approaches. The set of the demonstrations is made available to the learner. The policy derived from this dataset enables the learner to select an action based on the current state.
Before continuing, we pause to place the intents of this survey within the context of previous LfD literature. The aim of this survey is to review the broad topic of LfD, to provide a categorization that highlights differences between approaches, and to identify research areas within LfD that have not yet been explored.
We begin with a comment on terminology. Demonstration-based learning techniques are described by a variety of terms within the published literature, including Learning by Demonstration (LbD), Learning from Demonstration (LfD), Programming by Demonstration (PbD), Learning by Experienced Demonstrations, Assembly Plan from Observation, Learning by Showing, Learning by Watching, Learning from Observation, behavioral cloning, imitation and mimicry. While the definitions for some of these terms, such as imitation, have been loosely borrowed from other sciences, the overall use of these terms is often inconsistent or contradictory across articles.
Within this article, we refer to the general category of algorithms in which a policy is derived based on demonstrated data as Learning from Demonstration (LfD). Within this category, we further distinguish between approaches by their various characteristics, as outlined in Section 2, such as the source of the demonstrations and the learning techniques applied. Subsequent sections introduce terms used to characterize algorithmic differences. Due to the already contradictory use of terms in the existing literature, our definitions will not always agree with those of other publications. Our intent, however, is not for others in the field to adopt the terminology presented here, but rather to provide a consistent set of definitions that highlight distinctions between techniques.
Regarding a categorization for approaches, we note that many legitimate criteria could be used to subdivide LfD research. For example, one proposed categorization considers the broad spectrum of who, what, when and how to imitate, or subsets thereof [2], [3]. Our review aims to focus on the specifics of implementation. We therefore categorize approaches according to the computational formulations and techniques required to implement an LfD system.
To conclude, readers may also find useful other related surveys of the LfD research area. In particular, the book Imitation in Animals and Artifacts[4] provides an interdisciplinary overview of research in imitation learning, presenting leading work from neuroscience, psychology and linguistics as well as computer science. A narrower focus is presented in the chapter “Robot Programming by Demonstration” [2] within the book Handbook of Robotics. This work particularly highlights techniques which may augment or combine with traditional LfD, such as giving the teacher an active role during learning. By contrast, our focus is to provide a categorical structure for LfD approaches, in addition to presenting the specifics of implementation. We do refer the reader to this chapter for a more comprehensive historical overview of LfD, as the scope of our survey is restricted to recently published literature. Additional reviews that cover specific sub-areas of LfD research in detail are highlighted throughout the article.
Section snippets
Design choices
There are certain aspects of LfD which are common among all applications to date. One is the fact that a teacher demonstrates execution of a desired behavior. Another is that the learner is provided with a set of these demonstrations, and from them derives a policy able to reproduce the demonstrated behavior.
However, the developer still faces many design choices when developing a new LfD system. Some of these decisions, such as the choice of a discrete or continuous action representation, may
Gathering examples: How the dataset is built
In this section, we discuss various techniques for executing and recording demonstrations. The LfD dataset is composed of state–action pairs recorded during teacher executions of the desired behavior. Exactly how they are recorded, and what the teacher uses as a platform for the execution, varies greatly across approaches. Examples range from sensors on the robot learner recording its own actions as it is passively teleoperated by the teacher, to a camera recording a human teacher as she
Deriving a policy: The source of the state to action mapping
Given a dataset of state–action examples that have been acquired using one of the methods described in the previous section, we now discuss methods for deriving a policy using this data. LfD has seen the development of three core approaches to deriving policies from demonstration data, as summarized in Fig. 2. Learning a policy can involve simply learning an approximation to the state-action mapping (mapping function), or learning a model of the world dynamics and deriving a policy from this
Limitations of the demonstration dataset
LfD systems are inherently linked to the information provided in the demonstration dataset. As a result, learner performance is heavily limited by the quality of this information. In this article we identify two distinct causes for poor learner performance within LfD frameworks and survey the techniques that have been developed to address each limitation. The first cause, discussed in Section 5.1, is due to dataset sparsity, or the existence of areas of the state space that have not been
Future directions
As highlighted by the discussion in the previous sections, current approaches to LfD address a wide variety of problems under many different conditions and assumptions. In this section, we aim to highlight several promising areas of LfD research that have received limited attention, ranging from data representation to issues of system robustness and evaluation metrics.
Conclusion
In this article we have presented a comprehensive survey of Learning from Demonstration (LfD) techniques employed to address the robotics control problem. LfD has the attractive characteristics of being an intuitive communication medium for human teachers and of opening control algorithm development to non-robotics-experts. Additionally, LfD complements many traditional policy learning techniques, offering a solution to some of the weaknesses in traditional approaches. Consequently, LfD has
Acknowledgements
This research is partly sponsored by the Boeing Corporation under Grant No. CMU-BA-GTA-1, BBNT Solutions under subcontract No. 950008572, via prime Air Force contract No. SA-8650-06-C-7606, and the Qatar Foundation for Education, Science and Community Development. The views and conclusions contained in this document are solely those of the authors.
The authors would like to thank J. Andrew Bagnell and Darrin Bentivegna for feedback on the content and scope of this article.
Brenna D. Argall is a Ph.D. candidate in the Robotics Institute at Carnegie Mellon University. Prior to graduate school, she held a Computational Biology position in the Laboratory of Brain and Cognition, at the National Institutes of Health, while investigating visualization techniques for neural fMRI data. Argall received an M.S. in Robotics in 2006, and in 2002 a B.S. in Mathematics, both from Carnegie Mellon. Her research interests focus upon machine learning techniques to develop and
References (105)
- et al.
Discriminative and adaptive imitation in uni-manual and bi-manual tasks
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
(2006) - et al.
Using perspective taking to learn from ambiguous demonstrations
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
(2006) - et al.
Mobile robot programming using natural language
Robotics and Autonomous Systems
(2002) - et al.
Interaction rule learning with a human partner based on an imitation faculty with a simple visuo-motor mapping
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
(2006) - et al.
Robots that imitate humans
Trends in Cognitive Sciences
(2002) - et al.
Learning from demonstration and adaptation of biped locomotion
Robotics and Autonomous Systems
(2004) - et al.
Robust trajectory learning and approximation for robot programming by demonstration
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
(2006) - et al.
Learning human arm movements by imitation: Evaluation of biologically inspired connectionist architecture
Robotics and Autonomous Systems
(2001) - et al.
Programming full-body movements for humanoid robots by observation
Robotics and Autonomous Systems
(2004) - et al.
A cognitive framework for imitation learning
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
(2006)
Situated robot learning for multi-modal instruction and imitation of grasping
Robot Learning by Demonstration
Robotics and Autonomous Systems
Hierarchical attentive multiple models for execution and recognition of actions
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
A computational model of intention reading in imitation
The Social Mechanisms of Robot Programming by Demonstration
Robotics and Autonomous Systems
Natural actor-critic
Neurocomputing
Reinforcement learning: An introduction
Robot programming by demonstration
Computational approaches to motor learning by imitation
Philosophical Transactions: Biological Sciences
Programing by demonstration: Coping with suboptimal teaching actions
The International Journal of Robotics Research
A bayesian model of imitation in infants and robots
Correcting and improving imitation models of humans for robosoccer agents
Evolutionary Computation
Supervised actor-critic reinforcement learning
A posture sequence learning system for an anthropomorphic robotic hand
Robotics and Autonomous Systems
Visual learning by imitation with motor representations
IEEE Transactions on Systems, Man, and Cybernetics, Part B
A multi-agent system for programming robots by human demonstration
Integrated Computer-Aided Engineering
Cited by (0)
Brenna D. Argall is a Ph.D. candidate in the Robotics Institute at Carnegie Mellon University. Prior to graduate school, she held a Computational Biology position in the Laboratory of Brain and Cognition, at the National Institutes of Health, while investigating visualization techniques for neural fMRI data. Argall received an M.S. in Robotics in 2006, and in 2002 a B.S. in Mathematics, both from Carnegie Mellon. Her research interests focus upon machine learning techniques to develop and improve robot control systems, under the guidance of a human teacher.
Sonia Chernova is a Ph.D. student in the Computer Science Department at Carnegie Mellon University. She received her undergraduate degree in Computer Science and robotics from Carnegie Mellon University in 2003. Her research interests include learning and interaction in robotic systems.
Manuela Veloso is Herbert A. Simon Professor of Computer Science at Carnegie Mellon University. She received a licenciatura in Electrical Engineering in 1980, and an M.Sc. in Electrical and Computer Engineering in 1984 from the Instituto Superior Tecnico in Lisbon. She earned her Ph.D. in Computer Science from Carnegie Mellon in 1992. Veloso researches in planning, control learning, and execution algorithms, in particular for multi-robot teams. With her students, Veloso has developed teams of robot soccer agents, which have been RoboCup world champions several times. She is a Fellow of AAAI, the Association for the Advancement of Artificial Intelligence, an IEEE Senior member, and the President Elect (2008) of the International RoboCup Federation.
Brett Browning is a Senior Systems Scientist in Carnegie Mellon University’s School of Computer Science, where he has been a faculty member of the Robotics Institute since 2002. Prior to that, he was a postdoctoral fellow at Carnegie Mellon working with Manuela Veloso. Browning received his Ph.D. from the University of Queensland in 2000, and a B.Electrical Engineer and B.Sc (Math) from the same institution in 1996. His research interests are on robot autonomy and in particular real-time robot perception, applied machine learning, and teamwork.