Elsevier

Neurocomputing

Volume 74, Issue 8, 15 March 2011, Pages 1251-1259
Neurocomputing

Robust high performance reinforcement learning through weighted k-nearest neighbors

https://doi.org/10.1016/j.neucom.2010.07.027Get rights and content

Abstract

The aim of this paper is to present (jointly) a series of robust high performance (award winning) implementations of reinforcement learning algorithms based on temporal-difference learning and weighted k- nearest neighbors for linear function approximation. These algorithms, named kNN‐TD(λ) methods, where rigorously tested at the Second and Third Annual Reinforcement Learning Competitions (RLC2008 and RCL2009) held in Helsinki and Montreal respectively, where the kNN‐TD(λ) method (JAMH team) won in the PolyAthlon 2008 domain, obtained the second place in 2009 and also the second place in the Mountain-Car 2008 domain showing that it is one of the state of the art general purpose reinforcement learning implementations. These algorithms are able to learn quickly, to generalize properly over continuous state spaces and also to be robust to a high degree of environmental noise. Furthermore, we describe a derivation of kNN‐TD(λ) algorithm for problems where the use of continuous actions have clear advantages over the use of fine grained discrete actions: the Exa reinforcement learning algorithm.

Introduction

Reinforcement learning (RL) [11], [16] is a paradigm of machine learning (ML) in which rewards and punishments guide the learning process. In RL there is an Agent (learner) which acts autonomously and receives a scalar reward signal that is used to evaluate the consequences of its actions. The framework of RL is designed to guide the agent in maximizing the expected future cumulative (or average) reward received from the environment.

Eq. (1) is the classical temporal-difference (TD) learning rule:Q(s,a)t+1=Q(s,a)t+α[xtQ(s,a)t],wherext=rt+1+γmaxaQ(st+1,a)t[19,Q-Learning],xt=rt+1+γQ(st+1,at+1)t[16,SARSA].

This rule (1) comprises three strong ideas which are central to RL: (i) the notion that knowledge is, in essence, made of expectations (knowledge is prediction), (ii) the temporal-difference notion [17] which is a general method of prediction very well suited to incremental learning from immediate experience (the steps in a sequence should be evaluated and adjusted according to their immediate or near immediate successors, rather than according to the final outcome [17]) and (iii) the reward hypothesis (a.k.a. the reinforcement learning hypothesis) that states that “all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal called reward” [18].

In this paper we present a family of reinforcement learning algorithms. These algorithms are developed using the classical formulation of TD-learning and a k-nearest neighbors scheme for the perception, action selection, expectations memory and learning processes.

The perceptual process of the proposed methods is defined within a kind of k-nearest neighbors approach in order to produce a probabilistic representation of the input signal to construct robust state descriptions based on a collection (knn) of receptive field units and a probability distribution vector p(knn) over the knn collection. The action selection mechanisms of the proposed methods can be understood, in k-NN terminology, as that the Agent will select the most “frequent” recommended action in the neighborhood of the observed state s or, in Bayesian terms, the action for which there is more evidence of a high future reward. Finally, the learning mechanism can be understood as a kind of temporal difference back-propagation which assigns the right TD error to each neighbor according to its respective influence in the decision making and action selection procedures, which is represented by its probability of being the nearest neighbor of the observed states.

The k-nearest neighbors (k-NN) technique [6], [7], [8] is a method for classifying objects based on closest training examples in the feature space. The training examples are mapped into a multidimensional feature space. The space is partitioned into regions by class labels of the training samples. A point in the space is assigned to a class if it is the most frequent class label among the k-nearest samples used in the training phase. The determinant parameters in k-NN techniques are the number k which determines how many units are to be considered as the neighbors and the distance function, generally the Euclidian distance is used.

It has been proved [7] that there is a close relation between nearest neighbors methods and the Bayesian theory of optimal decision making. One of the most relevant results on this line is the Bayesian error bound for the nearest neighbors method which establishes that the upper bound on the probability of error for the NN-rule is less than twice the Bayes' error rate [7].

There has been other uses of the k-NN rule, e.g. in dynamic programming (DP) [9] for function approximation with successful results. Also, we must mention that there is a relation between the kind of algorithms that we propose and the so-called “locally weighted learning” (LWL) methods surveyed by Atkeson and Moore [4]. LWL approximations can also be viewed as a particular kind of artificial neural network (ANN) for approximating smooth functions [5].

In this paper, we show the use of the k-NN technique in the development of three RL algorithms: the kNN-TD, kNN‐TD(λ) and the Exa algorithm. Also we present an online adaptive filter, the k-NNδs online adaptive filter, for dealing with highly noisy environments. These algorithms are based on some preliminary works [12], [13], [14]. Public (open-source) implementations of these algorithms are published in the web page http://www.dia.fi.upm.es/∼jamartin/download.htm in order that these tools be used, evaluated or improved by the interested researchers.

The rest of the paper is organized as follows: Section 2 presents a basic and very simple TD-learning algorithm based on the simple Nearest Neighbor approach and describes some issues related to this algorithm. Section 3 presents the details of the kNN-TD and the kNN‐TD(λ) algorithms. In Section 4 the continuous actions Exa Reinforcement Learning algorithm is presented. The experimental results and performance remarks are presented in Section 6. Finally in Section 7 the conclusions and further work are presented.

Section snippets

A basic TD-learning algorithm by using simple nearest neighbor

By using the simple nearest neighbors approach a very basic but general enough TD-learning algorithm could be developed.

The first task to address is the definition of the points which will act as the neighbors of the observations (s). Although there are many ways of covering the space, a simple procedure is to generate points uniformly distributed over the complete state space.

Finally for each defined point xi in the state space will be an associated action-value predictor Q(i,a).

A pseudo-code

The k NN-TD reinforcement learning algorithm

The algorithm description will follow the classical interaction cycle in RL. The basic interaction cycle in RL could be described by four basic steps as shown in Fig. 1:

  • 1.

    Perceive the state s of the environment.

  • 2.

    Emit an action a to the environment based on s.

  • 3.

    Observe the new state s of the environment and the reward r received.

  • 4.

    Learn from the experience based on s, a, s and r.

Continuous actions: the Exa(λ) algorithm

A pseudo-code of the Exa(λ) RL algorithm is presented in Algorithm 4. This algorithm is designed to deal with problems where the use of continuous actions have clear advantages over the use of fine grained discrete actions; some of the reasons of such advantage may include tractability–feasibility of the action space, learning speed and the accuracy of the approximated suboptimal actions with respect to the optimal actions.

Algorithm 4

The Exa(λ) reinforcement learning control algorithm for continuous

Handling noisy environments: the k-NNδs online adaptive filter

The k-NNδs is an online adaptive filter for noisy observations in continuous state spaces that learns of incremental form.

Since in some RL problems the observations are highly noisy, we have developed an online-adaptive filter based on the k-NN technique. This filter will learn the expected value of the state transition difference, thus the filter will learn based on a triplet state-action-nextstate (s,a,s) and for each pair (s,a) it must learn a delta (δ) which will represent the difference

Experimental results

The family of algorithms kNN-TD has been intensively tested on various kinds of problems with very robust results. These problems range from the classical Mountain-Car problem, The Acrobot and The Cart-Pole to some specific problems in robot control.

These algorithms (the kNN‐TD(λ) methods) were rigorously tested at the Second and Third Annual Reinforcement Learning Competitions (RLC2008 and RCL2009) held in Helsinki and Montreal respectively [1], [20], where the kNN‐TD(λ) method (JAMH team) won

Conclusions and further work

We have presented a series of robust high performance (award winning) implementations of reinforcement learning algorithms based on temporal-difference learning and weighted k-nearest neighbors. These algorithms were rigorously tested at the Second and Third Annual Reinforcement Learning Competitions (RLC2008 and RCL2009) held in Helsinki and Montreal respectively, where the kNN‐TD(λ) method (JAMH team) won in the PolyAthlon 2008 domain, obtained the second place in 2009 and also the second

Acknowledgement

This work has been partially funded by the Spanish Ministry of Science and Technology, project DPI2006-15346-C03-02.

Jose Antonio Martin H. received the B.S. and M.S. degrees in Computer Science from La Universidad del Zulia (LUZ) in 2002 and the Ph.D. degree in Computer Science from Universidad Politécnica de Madrid (UPM), Madrid, in 2009, “Studies on Adaptive Systems with applications in Autonomous Robots and Intelligent Agents.” In addition, he is on the Ph.D. program of “Fundamentals of Basic psychology” at the Universidad Nacional de Educación a Distancia (UNED), Madrid, where he has received the

References (20)

  • C.W. Anderson

    Strategy learning in multilayer connectionist representations

  • The 2008 reinforcement learning competition, June 2008. URL...
  • P. Abbeel, V. Ganapathi, A.Y. Ng, Learning vehicular dynamics, with application to modeling helicopters, in: NIPS,...
  • C. Atkeson et al.

    Locally weighted learning

    AI Review

    (1997)
  • S. Bosman, Locally weighted approximations: yet another type of neural network, Master's Thesis, Intelligent Autonomous...
  • T.M. Cover et al.

    Nearest neighbor pattern classification

    IEEE Transactions on Information Theory

    (1967)
  • R.O. Duda et al.

    Pattern Classification and Scene Analysis

    (1973)
  • S.A. Dudani

    The distance-weighted k-nearest-neighbor rule

    IEEE Transactions on Systems, Man and Cybernetics

    (1976)
  • G.J. Gordon, Stable function approximation in dynamic programming, in: ICML, 1995, pp....
  • P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in: STOC, 1998, pp....
There are more references available in the full text version of this article.

Cited by (0)

Jose Antonio Martin H. received the B.S. and M.S. degrees in Computer Science from La Universidad del Zulia (LUZ) in 2002 and the Ph.D. degree in Computer Science from Universidad Politécnica de Madrid (UPM), Madrid, in 2009, “Studies on Adaptive Systems with applications in Autonomous Robots and Intelligent Agents.” In addition, he is on the Ph.D. program of “Fundamentals of Basic psychology” at the Universidad Nacional de Educación a Distancia (UNED), Madrid, where he has received the Advanced Studies Diploma on “a computational model of the equivalence class formation psychological phenomenon.” Since 2005, he is with the Department of Informatic Systems and Computing, Universidad Complutense de Madrid (UCM). His main research areas are cybernetics, machine learning and machine perception.

Javier de Lope (SM’94, M’98) received the M.Sc. in Computer Science from the Universidad Politécnica de Madrid in 1994 and the Ph.D. degree at the same university in 1998. Currently, he is Associate Professor in the Department of Applied Intelligent Systems at the Universidad Politécnica de Madrid. His current research interest is centered on the study, design and construction of modular robots and multi-robot systems, and in the development of control systems based on soft-computing techniques. He is currently leading a three-year R&D project for developing industrial robotics mechanisms which follow the guidelines of multi-robot systems and reconfigurable robotics. In the past he also worked on projects related to the computer-aided automatic driving by means of external cameras and range sensors and the design and control of humanoid and flying robots.

Darío Maravall (SM’78, M’80) received the M.Sc. in Telecommunication Engineering from the Universidad Politécnica de Madrid in 1978 and the Ph.D. degree at the same university in 1980. From 1980 to 1988 he was Associate Professor at the School of Telecommunication Engineering, Universidad Politecnica de Madrid. In 1988 he was promoted to Full Professor at the Faculty of Computer Science, Universidad Politecnica de Madrid. From 2000 to 2004 he was the Director of the Department of Artificial Intelligence of the Faculty of Computer Science at the Universidad Politecnica de Madrid. His current research interests include computer vision, autonomous robots and computational intelligence. He has published extensively on these subjects and has directed more than 20 funded projects,including a five year R&D project for the automated inspection of wooden pallets using computer vision techniques and robotic mechanisms, with several operating plants in a number of European countries. As a result of this project he holds a patent issued by the European Patent Office at The ague, The Netherlands.

View full text