Elsevier

Neurocomputing

Volume 71, Issues 7–9, March 2008, Pages 1180-1190
Neurocomputing

Natural Actor-Critic

https://doi.org/10.1016/j.neucom.2007.11.026Get rights and content

Abstract

In this paper, we suggest a novel reinforcement learning architecture, the Natural Actor-Critic. The actor updates are achieved using stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.

Introduction

Reinforcement learning algorithms based on value function approximation have been highly successful with discrete lookup table parameterization. However, when applied with continuous function approximation, many of these algorithms failed to generalize, and few convergence guarantees could be obtained [24]. The reason for this problem can largely be traced back to the greedy or ε-greedy policy updates of most techniques, as it does not ensure a policy improvement when applied with an approximate value function [8]. During a greedy update, small errors in the value function can cause large changes in the policy which in return can cause large changes in the value function. This process, when applied repeatedly, can result in oscillations or divergence of the algorithms. Even in simple toy systems, such unfortunate behavior can be found in many well-known greedy reinforcement learning algorithms [6], [8].

As an alternative to greedy reinforcement learning, policy-gradient methods have been suggested. Policy gradients have rather strong convergence guarantees, even when used in conjunction with approximate value functions, and recent results created a theoretically solid framework for policy-gradient estimation from sampled data [25], [15]. However, even when applied to simple examples with rather few states, policy-gradient methods often turn out to be quite inefficient [14], partially caused by the large plateaus in the expected return landscape where the gradients are small and often do not point directly towards the optimal solution. A simple example that demonstrates this behavior is given in Fig. 1.

Similar as in supervised learning, the steepest ascent with respect to the Fisher information metric [3], called the ‘natural’ policy gradient, turns out to be significantly more efficient than normal gradients. Such an approach was first suggested for reinforcement learning as the ‘average natural policy gradient’ in [14], and subsequently shown in preliminary work to be the true natural policy gradient [21], [4]. In this paper, we take this line of reasoning one step further in Section 2.2 by introducing the ‘Natural Actor-Critic (NAC)’ which inherits the convergence guarantees from gradient methods. Furthermore, in Section 3, we show that several successful previous reinforcement learning methods can be seen as special cases of this more general architecture. The paper concludes with empirical evaluations that demonstrate the effectiveness of the suggested methods in Section 4.

Section snippets

Markov decision process notation and assumptions

For this paper, we assume that the underlying control problem is a Markov decision process (MDP) in discrete time with continuous state set X=Rn, and a continuous action set U=Rm [8]. The assumption of an MDP comes with the limitation that very good state information and Markovian environment are assumed. However, similar as in [1], the results presented in this paper might extend to problems with partial state information.

The system is at an initial state x0X at time t=0 drawn from the

Properties of NAC

In this section, we will emphasize certain properties of the NAC. In particular, we want to give a simple proof of covariance of the natural policy gradient, and discuss [14] observation that in his experimental settings the natural policy gradient was non-covariant. Furthermore, we will discuss another surprising aspect about the NAC which is its relation to previous algorithms. We briefly demonstrate that established algorithms like the classic Actor-Critic [24], and Bradtke's Q-Learning [10]

Evaluations and applications

In this section, we present several evaluations comparing the episodic NAC architectures with previous algorithms. We compare them in optimization tasks such as Cart-Pole Balancing and simple motor primitive evaluations and compare them only with episodic NAC. Furthermore, we apply the combination of episodic NAC and the motor primitive framework to a robotic task on a real robot, i.e., ‘hitting a T-ball with a baseball bat’.

Conclusion

In this paper, we have summarized novel developments in policy-gradient reinforcement learning, and based on these, we have designed a novel reinforcement learning architecture, the NAC algorithm. This algorithm comes in (at least) two forms, i.e., the LSTD-Q(λ) form which depends on sufficiently rich basis functions, and the Episodic form which only requires a constant as additional basis function. We compare both algorithms and apply the latter on several evaluative benchmarks as well as on a

Jan Peters heads the Robot Learning Lab (RoLL) at the Max-Planck Institute for Biological Cybernetics (MPI) while being an invited researcher at the Computational Learning and Motor Control Lab at the University of Southern California (USC). Before joining MPI, he graduated from University of Southern California with a Ph.D. in Computer Science in March 2007. Jan Peters studied Electrical Engineering, Computer Science and Mechanical Engineering. He holds two German M.Sc. degrees in Informatics

References (28)

  • D. Aberdeen, Policy-gradient algorithms for partially observable Markov decision processes, Ph.D. Thesis, Australian...
  • D. Aberdeen

    POMDPs and policy gradients

  • S. Amari

    Natural gradient works efficiently in learning

    Neural Comput.

    (1998)
  • J. Bagnell et al.

    Covariant policy search

  • L.C. Baird, Advantage updating, Technical Report WL-TR-93-1146, Wright Lab.,...
  • L.C. Baird, A.W. Moore, Gradient descent for general reinforcement learning, in: Advances in Neural Information...
  • P. Bartlett, An introduction to reinforcement learning theory: value function methods, in: Machine Learning Summer...
  • D.P. Bertsekas et al.

    Neuro-Dynamic Programming

    (1996)
  • J. Boyan, Least-squares temporal difference learning, in: Machine Learning: Proceedings of the Sixteenth International...
  • S. Bradtke et al.

    Adaptive Linear Quadratic Control Using Policy Iteration

    (1994)
  • O. Buffet et al.

    Shaping multi-agent systems with gradient reinforcement learning

    Autonomous Agents Multi-Agent Syst.

    (October 2007)
  • F. Guenter, M. Hersch, S. Calinon, A. Billard, Reinforcement learning for imitating constrained reaching movements, RSJ...
  • A. Ijspeert, J. Nakanishi, S. Schaal, Learning rhythmic movements by demonstration using nonlinear oscillators, in:...
  • S.A. Kakade, Natural policy gradient, in: Advances in Neural Information Processing Systems, vol. 14,...
  • Cited by (817)

    • Reinforcement learning algorithms: A brief survey

      2023, Expert Systems with Applications
    View all citing articles on Scopus

    Jan Peters heads the Robot Learning Lab (RoLL) at the Max-Planck Institute for Biological Cybernetics (MPI) while being an invited researcher at the Computational Learning and Motor Control Lab at the University of Southern California (USC). Before joining MPI, he graduated from University of Southern California with a Ph.D. in Computer Science in March 2007. Jan Peters studied Electrical Engineering, Computer Science and Mechanical Engineering. He holds two German M.Sc. degrees in Informatics and in Electrical Engineering (Dipl-Informatiker from Hagen University and Diplom-Ingenieur from Munich University of Technology/TUM) and two M.Sc. degrees in Computer Science and Mechanical Engineering from University of Southern California (USC). During his graduate studies, Jan Peters has been a visiting researcher at the Department of Robotics at the German Aerospace Research Center (DLR) in Oberpfaffenhofen, Germany, at Siemens Advanced Engineering (SAE) in Singapore, at the National University of Singapore (NUS), and at the Department of Humanoid Robotics and Computational Neuroscience at the Advanced Telecommunication Research (ATR) Center in Kyoto, Japan. His research interests include robotics, nonlinear control, machine learning, and motor skill learning.

    Stefan Schaal is an Associate Professor at the Department of Computer Science and the Neuroscience Program at the University of Southern California, and an Invited Researcher at the ATR Human Information Sciences Laboratory in Japan, where he held an appointment as Head of the Computational Learning Group during an international ERATO project, the Kawato Dynamic Brain Project (ERATO/JST). Before joining USC, Dr. Schaal was a postdoctoral fellow at the Department of Brain and Cognitive Sciences and the Artificial Intelligence Laboratory at MIT, an Invited Researcher at the ATR Human Information Processing Research Laboratories in Japan, and an Adjunct Assistant Professor at the Georgia Institute of Technology and at the Department of Kinesiology of the Pennsylvania State University. Dr. Schaal's research interests include topics of statistical and machine learning, neural networks, computational neuroscience, functional brain imaging, nonlinear dynamics, nonlinear control theory, and biomimetic robotics. He applies his research to problems of artificial and biological motor control and motor learning, focusing on both theoretical investigations and experiments with human subjects and anthropomorphic robot equipment.

    View full text