ABSTRACT
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function.
- Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. (Full paper.) http://www.cs.stanford.edu/~pabbeel/irl/. Google ScholarDigital Library
- Amit, R., & Mataric, M. (2002). Learning movement sequences from demonstration. Proc. ICDL. Google ScholarDigital Library
- Atkeson, C., & Schaal, S. (1997). Robot learning from demonstration. Proc. ICML. Google ScholarDigital Library
- Demiris, J., & Hayes, G. (1994). A robot controller using learning by imitation.Google Scholar
- Hogan, N. (1984). An organizing principle for a class of voluntary movements. J. of Neuroscience, 4, 2745--2754.Google ScholarCross Ref
- Kuniyoshi, Y., Inaba, M., & Inoue, H. (1994). Learning by watching: Extracting reusable task knowledge from visual observation of human performance. T-RA, 10, 799--822.Google Scholar
- Manne, A. (1960). Linear programming and sequential decisions. Management Science, 6.Google Scholar
- Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. Proc. ICML. Google ScholarDigital Library
- Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. Proc. ICML. Google ScholarDigital Library
- Pomerleau, D. (1989). Alvinn: An autonomous land vehicle in a neural network. NIPS 1. Google ScholarDigital Library
- Rockafellar, R. (1970). Convex analysis. Princeton University Press.Google Scholar
- Sammut, C., Hurst, S., Kedzier, D., & Michie, D. (1992). Learning to fly. Proc. ICML. Google ScholarDigital Library
- Uno, Y., Kawato, M., & Suzuki, R. (1989). Formation and control of optimal trajectory in human multijoint arm movement. minimum torque-change model. Biological Cybernetics, 61, 89--101.Google ScholarDigital Library
- Vapnik, V. N. (1998). Statistical learning theory. John Wiley & Sons. Google ScholarDigital Library
- Apprenticeship learning via inverse reinforcement learning
Recommendations
Inverse Reinforcement Learning in Partially Observable Environments
Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it ...
Bayesian inverse reinforcement learning
IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligenceInverse Reinforcement Learning (IRL) is the problem of learning the reward function underlying a Markov Decision Process given the dynamics of the system and the behaviour of an expert. IRL is motivated by situations where knowledge of the rewards is a ...
Comments