2008 Special IssueReinforcement learning of motor skills with policy gradients
Introduction
In order to ever leave the well-structured environments of factory floors and research labs, future robots will require the ability to acquire novel behaviors and motor skills as well as to improve existing ones based on rewards and costs. Similarly, the understanding of human motor control would benefit significantly if we can synthesize simulated human behavior and its underlying cost functions based on insight from machine learning and biological inspirations. Reinforcement learning is probably the most general framework in which such learning problems of computational motor control can be phrased. However, in order to bring reinforcement learning into the domain of human movement learning, two deciding components need to be added to the standard framework of reinforcement learning: first, we need a domain-specific policy representation for motor skills, and, second, we need reinforcement learning algorithms which work efficiently with this representation while scaling into the domain of high-dimensional mechanical systems such as humanoid robots.
Traditional representations of motor behaviors in robotics are mostly based on desired trajectories generated from spline interpolations between points, i.e., spline nodes, which are part of a longer sequence of intermediate target points on the way to a final movement goal. While such a representation is easy to understand, the resulting control policies, generated from a tracking controller of the spline trajectories, have a variety of significant disadvantages, including that they are time indexed and thus not robust towards unforeseen disturbances, that they do not easily generalize to new behavioral situations without complete recomputation of the spline, and that they cannot easily be coordinated with other events in the environment, e.g., synchronized with other sensory variables like visual perception during catching a ball. In the literature, a variety of other approaches for parameterizing movement have been suggested to overcome these problems, see Ijspeert et al., 2002, Ijspeert et al., 2003 for more information. One of these approaches proposed using parameterized nonlinear dynamical systems as motor primitives, where the attractor properties of these dynamical systems defined the desired behavior (Ijspeert et al., 2002, Ijspeert et al., 2003). The resulting framework was particularly well suited for supervised imitation learning in robotics, exemplified by examples from humanoid robotics where a full-body humanoid learned tennis swings or complex polyrhythmic drumming patterns. One goal of this paper is the application of reinforcement learning to both traditional spline-based representations as well as the more novel dynamic system based approach.
However, despite the fact that reinforcement learning is the most general framework for discussing the learning of movement in general, and motor primitives for robotics in particular, most of the methods proposed in the reinforcement learning community are not applicable to high-dimensional systems such as humanoid robots. Among the main problems are that these methods do not scale beyond systems with more than three or four degrees of freedom and/or cannot deal with parameterized policies. Policy gradient methods are a notable exception to this statement. Starting with the pioneering work1 of Gullapali and colleagues (Benbrahim and Franklin, 1997, Gullapalli et al., 1994) in the early 1990s, these methods have been applied to a variety of robot learning problems ranging from simple control tasks (e.g., balancing a ball on a beam (Benbrahim, Doleac, Franklin, & Selfridge, 1992), and pole balancing (Kimura & Kobayashi, 1998)) to complex learning tasks involving many degrees of freedom such as learning of complex motor skills (Gullapalli et al., 1994, Mitsunaga et al., 2005, Miyamoto et al., 1995, Miyamoto et al., 1996, Peters and Schaal, 2006, Peters et al., 2005a) and locomotion (Endo et al., 2005, Kimura and Kobayashi, 1997, Kohl and Stone, 2004, Mori et al., 2004, Nakamura et al., 2004, Sato et al., 2002, Tedrake et al., 2005).
The advantages of policy gradient methods for parameterized motor primitives are numerous. Among the most important ones are that the policy representation can be chosen such that it is meaningful for the task, i.e., we can use a suitable motor primitive representation, and that domain knowledge can be incorporated, which often leads to fewer parameters in the learning process in comparison to traditional value function based approaches. Moreover, there exist a variety of different algorithms for policy gradient estimation in the literature, most with rather strong theoretical foundations. Additionally, policy gradient methods can be used model-free and therefore also be applied to problems without analytically known task and reward models.
Nevertheless, many recent publications on applications of policy gradient methods in robotics overlooked the newest developments in policy gradient theory and their original roots in the literature. Thus, a large number of heuristic applications of policy gradients can be found, where the success of the projects mainly relied on ingenious initializations and manual parameter tuning of algorithms. A closer inspection often reveals that the chosen methods might be statistically biased, or even generate infeasible policies under less fortunate parameter settings, which could lead to unsafe operation of a robot. The main goal of this paper is to discuss which policy gradient methods are applicable to robotics and which issues matter, while also introducing some new policy gradient learning algorithms that seem to have superior performance over previously suggested methods. The remainder of this paper will proceed as follows: firstly, we will introduce the general assumptions of reinforcement learning, discuss motor primitives in this framework and pose the problem statement of this paper. Secondly, we will analyze the different approaches to policy gradient estimation and discuss their applicability to reinforcement learning of motor primitives. We focus on the most useful methods and examine several algorithms in depth. The presented algorithms in this paper are highly optimized versions of both novel and previously published policy gradient algorithms. Thirdly, we show how these methods can be applied to motor skill learning in humanoid robotics and show learning results with a seven degree of freedom, anthropomorphic SARCOS Master Arm.
Most robotics domains require the state-space and the action spaces to be continuous and high dimensional such that learning methods based on discretizations are not applicable for higher-dimensional systems. However, as the policy is usually implemented on a digital computer, we assume that we can model the control system in a discrete-time manner and we will denote the current time step 2by . In order to take possible stochasticity of the plant into account, we denote it using a probability distribution where denotes the current action, and , denote the current and the next state respectively. We furthermore assume that actions are generated by a policy which is modeled as a probability distribution in order to incorporate exploratory actions; for some special problems, the optimal solution to a control problem is actually a stochastic controller, see e.g., Sutton, McAllester, Singh, and Mansour (2000). The policy is parameterized by some policy parameters and assumed to be continuously differentiable with respect to its parameters . The sequence of states and actions forms a trajectory (also called history or roll-out) denoted by where denotes the horizon, which can be infinite. At each instant of time, the learning system receives a reward denoted by .
The general goal of policy gradient reinforcement learning is to optimize the policy parameters so that the expected return is optimized where denote time-step-dependent weighting factors and is a normalization factor in order to ensure that the normalized weights sum up to one. We require that the weighting factors fulfill in order to be able to connect to the previous policy gradient literature; examples are the weights for discounted reinforcement learning (where is in ) where ; alternatively, they are set to for the average reward case where . In these cases, we can rewrite a normalized expected return in the form as used in Sutton et al. (2000), where is the weighted state distribution.3
In general, we assume that for each considered policy , a state-value function , and the state-action value function exist and are given by In the infinite horizon case, i.e., for , we write and as these functions have become time-invariant. Note, that we can define the expected return in terms of the state-value function by where is the probability of being the start-state. Whenever we make practical use of the value function, we assume that we are given “good” basis functions so that the state-value function can be approximated with linear function approximation with parameters in an approximately unbiased way.
In this section, we first discuss how motor plans can be represented and then how we can bring these into the standard reinforcement learning framework. For this purpose, we consider two forms of motor plans, i.e., (1) spline-based trajectory plans and (2) nonlinear dynamic motor primitives introduced in Ijspeert et al. (2002). Spline-based trajectory planning is well known in the robotics literature, see e.g., Miyamoto et al. (1996) and Sciavicco and Siciliano (2007). A desired trajectory is represented as connected pieces of simple polynomials, e.g., for third-order splines, we have in under the boundary conditions of A given tracking controller, e.g., a PD control law or an inverse dynamics controller, ensures that the trajectory is realized accurately. Thus, a desired movement is parameterized by its spline nodes and the duration of each spline node. These parameters can be learned from fitting a given trajectory with a spline approximation algorithm (Wada & Kawato, 1994), or by means of optimization or reinforcement learning (Miyamoto et al., 1996). We call such parameterized movement plans motor primitives.
For nonlinear dynamic motor primitives, we use the approach developed in Ijspeert et al. (2002). These dynamic motor primitives can be seen as a type of central pattern generator which is particularly well suited for learning as it is linear in the parameters and are invariant under rescaling. In this approach, movement plans for each degree of freedom (DOF) of the robot are represented in terms of the time evolution of the nonlinear dynamical systems where denote the desired position and velocity of a joint, the internal state of the dynamic system which evolves in accordance to a canonical system , the goal (or point attractor) state of each DOF, the movement duration shared by all DOFs, and the open parameters of the function . In contrast to splines, formulating movement plans as dynamic systems offers useful invariance properties of a movement plan under temporal and spatial scaling, as well as natural stability properties — see Ijspeert et al. (2002) for a discussion. Adjustment of the primitives using sensory input can be incorporated by modifying the internal state of the system as shown in the context of drumming (Pongas, Billard, & Schaal, 2005) and biped locomotion (Nakanishi et al., 2004, Schaal et al., 2004). The equations used in order to create Eq. (9) are given in the Appendix. The original work in Ijspeert et al. (2002) demonstrated how the parameters can be learned to match a template trajectory by means of supervised learning — this scenario is, for instance, useful as the first step of an imitation learning system. Here, we will add the ability of self-improvement of the movement primitives in Eq. (9) by means of reinforcement learning, which is the crucial second step in imitation learning.
The systems in Eqs. (8), (9) are point-to-point movements, i.e., such tasks are rather well suited for the introduced episodic reinforcement learning methods. In both systems, we have access to at least 2nd derivatives in time, i.e., desired accelerations, which are needed for model-based feedforward controllers. In order to make the reinforcement framework feasible for learning with motor primitives, we need to add exploration to the respective motor primitive framework, i.e., we need to add a small perturbation to the desired accelerations, such that the nominal target output becomes the perturbed target output . By doing so, we obtain a stochastic policy This policy will be used throughout the paper. It is particularly practical as the exploration can be easily controlled through only one variable .
Section snippets
Policy gradient approaches for parameterized motor primitives
The general goal of policy optimization in reinforcement learning is to optimize the policy parameters so that the expected return is maximal. For motor primitive learning in robotics, we require that any change to the policy parameterization has to be smooth as drastic changes can be hazardous for the robot and its environment. Also, it would render initializations of the policy based on domain knowledge or imitation learning useless, as these would otherwise vanish after a single
‘Vanilla’ policy gradient approaches
Despite the fast asymptotic convergence speed of the gradient estimate, the variance of the likelihood-ratio gradient estimator can be problematic in theory as well as in practice. This can be illustrated straightforwardly with an example.
Example 1 When using a REINFORCE estimator with a baseline in a scenario where there is only a single reward of always the same magnitude, e.g., for all , then the variance of the gradient estimate will grow at least cubically with the length of the
Natural Actor-Critic
Despite all the advances in the variance reduction of policy gradient methods, partially summarized above, these methods still tend to perform surprisingly poorly. Even when applied to simple examples with rather few states, where the gradient can be determined very accurately, they turn out to be quite inefficient — thus, the underlying reason cannot solely be the variance in the gradient estimate but rather must be caused by the large plateaus in the expected return landscape where the
Empirical evaluations
In the previous section, we outlined five model-free policy gradient algorithms. From our assessment, these are among the most relevant for learning motor primitives in robotics as they can be applied without requiring additional function approximation methods and as they are suitable for episodic settings. These algorithms were (i) finite-difference gradient estimators, the vanilla policy gradient estimators (ii) with a constant baseline and (iii) a time-variant baseline, as well as the
Conclusion & discussion
We have presented an extensive survey of policy gradient methods. While some developments needed to be omitted as they are only applicable for very low-dimensional state-spaces, this paper largely summarized the state of the art in policy gradient methods as applicable in robotics with high degree-of-freedom movement systems. All the three major ways of estimating first-order gradients, i.e., finite-difference gradients, vanilla policy gradients and natural policy gradients are discussed in
References (76)
- et al.
Biped dynamic walking using reinforcement learning
Robotics and Autonomous Systems
(1997) - et al.
Motor primitives in vertebrates and invertebrates
Current Opinions in Neurobiology
(2005) A stochastic reinforcement learning algorithm for learning real-valued functions
Neural Networks
(1990)- et al.
Learning from demonstration and adaptation of biped locomotion
Robotics and Autonomous Systems
(2004) - Aberdeen, D. (2006). POMDPs and policy gradients, presentation at the Machine Learning Summer School...
- et al.
Stochastic optimization
Engineering Cybernetics
(1968) Natural gradient works efficiently in learning
Neural Computation
(1998)Using local trajectory optimizers to speed up global optimization in dynamic programming
- Bagnell, J., & Schneider, J. (2003). Covariant policy search. In Proceedings of the international joint conference on...
- Baird, L. (1993). Advantage updating. Technical Report WL-TR-93-1146. Wright laboratory, Wright–Patterson air force...