Continuous action reinforcement learning automata and their application to adaptive digital filter design

https://doi.org/10.1016/S0952-1976(01)00034-3Get rights and content

Abstract

In the design of adaptive IIR filters, the multi-modal nature of the error surfaces can limit the use of gradient-based and other iterative search methods. Stochastic learning automata have previously been shown to have global optimisation properties making them suitable for the optimisation of filter coefficients. Continuous action reinforcement learning automata are presented as an extension to the standard automata which operate over discrete parameter sets. Global convergence is claimed, and demonstrations are carried out via a number of computer simulations.

Introduction

Adaptive digital filters have found applications in a large number of areas such as signal processing, adaptive control, communication systems and noise cancellation. The objective of the adaptation is to alter the filter coefficients of a digital filter to match an unknown system transfer function. This is also a specific case of the more general problem of system identification, where model parameters of an unknown dynamic system are estimated from input/output data. The minimisation of a performance criterion, typically the mean squared error between filter output and desired response, is often attempted using iterative search techniques. However, when the optimisation surface is multi-modal, least-squared and gradient-based methods often fail to converge to the global minimum. The global optimisation properties of learning systems, such as genetic algorithms (Holland, 1975) and learning automata (Narendra and Thathachar, 1989) make them an attractive alternative in these cases. Here, an approach based on learning automata is developed as this is more applicable to adaptive systems which require learning in real time.

Discrete stochastic learning automata (Baba, 1984; Narendra and Thathachar, 1989; Najim and Poznak, 1994) can operate in random and unknown environments, and have been successfully applied in a number of areas, including practical engineering problems such as suspension control (Frost et al., 1996) and power system regulation (Wu, 1993). They operate by selecting actions via a stochastic process; these actions operate on an environment and are assessed according to a measure of system performance. Fig. 1 shows the typical learning system architecture; the automaton selects action (X) probabilistically, these are applied to the environment, and the performance evaluation function provides a reinforcement signal β. This is used to update the automaton's internal probability distribution, whereby actions that achieve desirable performance are reinforced via an increased probability, while those that do not are penalised or left unchanged depending on the particular learning rule employed. The result is that, over time, the average performance of the system will improve until some limit is reached. In terms of optimisation problems, the action with the highest probability corresponds to the global minimum, and rigorous proofs of convergence are available for this (Narendra and Thathachar, 1989; Najim and Poznak, 1994).

A wide variety of different learning rules have been reported in the literature, and one of the most widely used algorithms is the linear reward/inaction (LRI) scheme, which has been shown to have guaranteed and satisfactory convergence properties (Narendra and Thathachar, 1989). In response to action xi, being selected at time step n, the probabilities are updated as follows:pi(n+1)=pi(n)+θβ(n)(1−pi(n)),pj(n+1)=pj(n)−θβ(n)pj(n)ifi≠jθ is a learning rate parameter, 0<θ<1 and β∈[0,1] is the reinforcement signal; β=1 indicates the maximum reward and β=0 is a null reward. Eventually, the probability of successful actions will increase to become close to unity, and if a single, most successful action predominates in this way, the automaton is deemed to have converged.

With a large number of discrete actions, the probability of selecting any particular action becomes low and the convergence time can become excessive. To avoid this, learning automata can be connected in parallel as shown in Fig. 2. Each automaton operates on a smaller number of actions and the ‘team’ works together in a co-operative manner. This scheme can also be used where multiple actions are required.

Discrete stochastic learning automata can be used to determine global optima for digital filters which have multi-modal mean square error surfaces (Tang et al. (1989), Tang et al. (1991), Tang et al. (1993)). However, the discrete nature of the automata requires the discretisation of a continuous parameter space, and the level of quantisation tends to reduce the convergence rate. To help overcome this, a sequential approach may be adopted (Frost et al., 1996) where an initial coarse quantisation is later refined with re-quantisation around the most successful action. In this paper, an inherently continuous form of the learning automaton is used to speed the learning process and avoid this additional complexity.

Section snippets

CARLA algorithm

The continuous action reinforcement learning automata (CARLA) was developed as an extension of the discrete stochastic learning automata for applications involving searching of continuous action space in a random environment (Howell, 1997). Several CARLA can be connected in parallel, in a similar manner to discrete automata, to search multidimensional action spaces. The automaton's discrete probability distribution is replaced with a continuous probability density function, which is again used

Implementation

The new technique is now applied to some specific examples taken from the work of Tang et al. (1989), Tang et al. (1991), Tang et al. (1993). Fig. 4 represents the structure of these examples, which are implemented as stochastic optimisation problems, the input signal and the corrupting noise being generated by pseudo-random signals. For simplicity, the corrupting noise has been set to zero in the examples presented here. The examples contain only a small number of unknown filter parameters,

Variable parameter space

In the preceding section, the minima of the cost function were obtained over a fixed set of parameter ranges. The range over which the probability density function and hence the search range of the CARLA is defined was determined from the knowledge of the environment, primarily through restricting the search region to within the stability triangle. The Gaussian distribution obtained in these examples could be considered to be too wide a distribution and a way of reducing or expanding the search

Summary and conclusions

This paper has considered the application of a new learning method to the problem of selecting optimal parameters in digital IIR filters, to give minimum mean square output error compared to a prescribed, but unknown plant. The work has focused attention on obtaining global minima in situations where multi-modal error surfaces easily lead to local minima. The known global convergence properties of learning automata has motivated their use in this problem class, and the implementation of a

Mark Howell is a researcher in the department of Aeronautical and Automotive Engineering at Loughborough University. He received his Ph.D. from Sheffield University in 1994. His research interests are in machine learning and intelligent systems for control and optimisation, specifically in the development and application of reinforcement learning systems and evolutionary based algorithms for practical control solutions to engineering problems.

References (13)

  • M.N. Howell et al.

    Continuous action reinforcement learning applied to vehicle suspension control

    Mechatronics

    (1997)
  • N. Baba

    New topics in learning automata theory and applications

  • Fan, H., Jenkins, W.K., 1996. A new adaptive IIR filter. IEEE Transactions on Circuits and Systems. CAS-33 (10),...
  • G.P. Frost et al.

    Reinforcement learning of active and semi-active vehicle suspension control laws

    Proceedings of the Institution of Mechanical Engineers Part I

    (1996)
  • J. Holland

    Adaptation in Natural and Artificial Systems

    (1975)
  • C.R. Johnson et al.

    Comments on and additions to “an adaptive recursive LMS filter”

    Proceeding of the IEEE

    (1977)
There are more references available in the full text version of this article.

Cited by (46)

  • A novel dynamics model of ball-screw feed drives based on theoretical derivations and deep learning

    2019, Mechanism and Machine Theory
    Citation Excerpt :

    Neural networks are adopted here to approximate the gap between dynamics model and physical plant for their capability to learn any input-output mapping with the desired accuracy when sufficient number of hidden neurons are available. A novel parameter identification method named continuous action reinforcement learning automata is adopted to identify parameters [33–35]. The whole framework is shown in Fig. 1.

  • Automatic data clustering using continuous action-set learning automata and its application in segmentation of images

    2017, Applied Soft Computing Journal
    Citation Excerpt :

    Recently, researchers have used some techniques to cope with this problem [13,15,16,19,20]. Learning automata (LA) [21,22] is a heuristic approach that can be used in a wide range of applications such as pattern recognition [23], signal processing [25], adaptive control [24], and image segmentation [27]. Unlike nature inspired meta-heuristic algorithms, the proposed algorithm benefits from the CALA algorithm features [49] and has several advantages: (1) techniques based on nature inspired meta-heuristic algorithms (for example ACDE, DCPSO, GCUK and classical DE) for finding the correct number of clusters and their partitions use encoding methods such as chromosome encoding or similar methods.

  • Learning automata for image segmentation

    2016, Pattern Recognition Letters
    Citation Excerpt :

    Different learning automaton algorithms are distinguished by the ways their PDFs are updated. They find applications in signal processing [14], feedback control systems [13,33], power systems [30], and image processing [2,8,22,25]. The randomness in the generation of parameter estimates provides learning automaton algorithms with obvious advantages over the gradient-based and EM algorithms.

View all citing articles on Scopus

Mark Howell is a researcher in the department of Aeronautical and Automotive Engineering at Loughborough University. He received his Ph.D. from Sheffield University in 1994. His research interests are in machine learning and intelligent systems for control and optimisation, specifically in the development and application of reinforcement learning systems and evolutionary based algorithms for practical control solutions to engineering problems.

Tim Gordon is Ford Professor of Automotive Engineering at Loughborough University. He chairs the Dynamics and Control Research Group within the Department of Aeronautical and Automotive Engineering, and heads the teaching and research activities in Automotive Engineering. These activities involve more than 20 people within the Aero/Auto department—academics, research assistants and research students. Research interests include advanced suspension control and system identification, driveline modelling and control for electric and hybrid vehicles, crash impact sensing for intelligent occupant protection systems, integration of chassis control systems, and vehicle stability augmentation via active aerodynamic devices. Recent work has included an EPSRC sponsored project in the practical implementation of reinforcement learning on advanced vehicle suspensions. Many of the automotive research projects are industry related, with sponsorship from Ford, Rover/BMW, and Ricardo.

View full text