nach oben

1992 | Buch

Kapitel lesen Erstes Kapitel lesen

Reinforcement Learning

herausgegeben von: Richard S. Sutton

Verlag: Springer US

Buchreihe : The International Series in Engineering and Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics -- trial-and-error search and delayed reward -- are the most important distinguishing features of reinforcement learning.
Reinforcement learning is both a new and a very old topic in AI. The term appears to have been coined by Minsk (1961), and independently in control theory by Walz and Fu (1965). The earliest machine learning research now viewed as directly relevant was Samuel's (1959) checker player, which used temporal-difference learning to manage delayed reward much as it is used today. Of course learning and reinforcement have been studied in psychology for almost a century, and that work has had a very strong impact on the AI/engineering work. One could in fact consider all of reinforcement learning to be simply the reverse engineering of certain psychological learning processes (e.g. operant conditioning and secondary reinforcement).
Reinforcement Learning is an edited volume of original research, comprising seven invited contributions by leading researchers.

Inhaltsverzeichnis

Frontmatter

Introduction: The Challenge of Reinforcement Learning

Abstract

Richard S. Sutton

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Abstract

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

Ronald J. Williams

Practical Issues in Temporal Difference Learning

Abstract

This paper examines whether temporal difference methods for training connectionist networks, such as Sutton’s TD(λ) algorithm, can be successfully applied to complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD(λ) is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex non-trivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory, and it suggests that further analysis of TD methods, as well as applications in other complex domains, may be worth investigating.

Gerald Tesauro

Technical Note

Q-Learning

Abstract

Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.

This paper presents and proves in detail a convergence theorem for Q-learning based on that outlined in Watkins (1989). We show that Q,-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where many Q values can be changed each iteration, rather than just one.

Christopher J. C. H. Watkins, Peter Dayan

Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching

Abstract

To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus twofold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning.

This paper compares eight reinforcement learning frameworks:adaptive heuristic critic (AHC)learning due to Sutton,Q-learningdue to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.

Long-Ji Lin

Transfer of Learning by Composing Solutions of Elemental Sequential Tasks

Abstract

Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SIYI’s cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.

Satinder Pal Singh

The Convergence of TD(λ) for General λ

Abstract

The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones.

It also considers how this version of TD behaves in the face of linearly dependent representations for states—demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins’ theorem that Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.

Peter Dayan

A Reinforcement Connectionist Approach to Robot Path Finding in Non-Maze-Like Environments

Abstract

This paper presents a reinforcement connectionist system which finds and learns the suitable situation-action rules so as to generate feasible paths for a point robot in a 2D environment with circular obstacles. The basic reinforcement algorithm is extended with a strategy for discovering stable solution paths. Equipped with this strategy and a powerful codification scheme, the path-finder (i) learns quickly, (ii) deals with continuous-valued inputs and outputs, (iii) exhibits good noise-tolerance and generalization capabilities, (iv) copes with dynamic environments, and (v) solves an instance of the path finding problem with strong performance demands.

José R. Del Millán, Carme Torras

Backmatter

Titel: Reinforcement Learning
herausgegeben von: Richard S. Sutton
Verlag: Springer US
Electronic ISBN: 978-1-4615-3618-5
Print ISBN: 978-1-4613-6608-9
DOI: https://doi.org/10.1007/978-1-4615-3618-5

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Introduction: The Challenge of Reinforcement Learning

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Practical Issues in Temporal Difference Learning

Technical Note

Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching

Transfer of Learning by Composing Solutions of Elemental Sequential Tasks

The Convergence of TD(λ) for General λ

A Reinforcement Connectionist Approach to Robot Path Finding in Non-Maze-Like Environments

Backmatter