Skip to main content
main-content

Über dieses Buch

Learning constitutes one of the most important phase of the whole psychological processes and it is essential in many ways for the occurrence of necessary changes in the behavior of adjusting organisms. In a broad sense influence of prior behavior and its consequence upon subsequent behavior is usually accepted as a definition of learning. Till recently learning was regarded as the prerogative of living beings. But in the past few decades there have been attempts to construct learning machines or systems with considerable success. This book deals with a powerful class of learning algorithms that have been developed over the past two decades in the context of learning systems modelled by finite state probabilistic automaton. These algorithms are very simple iterative schemes. Mathematically these algorithms define two distinct classes of Markov processes with unit simplex (of suitable dimension) as its state space. The basic problem of learning is viewed as one of finding conditions on the algorithm such that the associated Markov process has prespecified asymptotic behavior. As a prerequisite a first course in analysis and stochastic processes would be an adequate preparation to pursue the development in various chapters.

Inhaltsverzeichnis

Frontmatter

Theory

Frontmatter

Chapter 1. Introduction

Abstract
Learning has been the interest of psychologists and mathematicians for decades and more recently, of engineers and computer-scientists. The interest of a psychologist or a mathematician in learning is to explain or describe the way in which animals learn to do a variety of skills by observing the changes in their behavior. Such an approach may be termed as the descriptive approach. A wide range of mathematical models have been developed for this purpose. The work by Bush and Mosteller [B10], Luce [L14] Hilgard [H4], Norman [N10], Iosifescu and Theodorescu [12] to mention a few, belong to this category. On the contrary, in systems theory and computer science the aim is to develop a computer program or build a machine, perhaps in the context of pattern recognition or artificial intelligence, which will learn to perform certain prespecified tasks such as to play games or classify a class of x-ray pictures and diagnose disease, etc. Fu [F4] [F5] [F6], Tsypkin [T4] [T5], Sklansky [S9], Mendel [M6 [M7], Saridis [S4], Nilsson [N9], Csibi [C10] Slagle [S10], Fukunaga [F8]. Such an approach is often called the prescriptive approach to learning.
S. Lakshmivarahan

Chapter 2. Ergodic Learning Algorithms

Abstract
This chapter presents an analysis of general non-linear reward-penalty ergodic-N R-P E algorithms. The basic property that characterizes this class of algorithms is that all the states under this class of algorithms are non-absorbing. The now classic linear reward-penalty — LER−P algorithm is a special case of this algorithm. It is well known [C1] that this LER−P algorithm is only expedient. Using the theory of Markov processes that evolve by small steps [N14] a variety of characterizations of the process p(k) k ≥ 0 such as the evolution of the mean and variance and in fact its actual sample path behavior are given. As a by-product, it is proved that there exists a proper choice of parameters and functions such that the NER−P algorithm is ε-optimal.
S. Lakshmivarahan

Chapter 3. Absolutely Expedient Algorithms

Abstract
This chapter deals with the analysis and design of non-linear reward penalty learning algorithms of the absolutely expedient NAR−P type. A fundamental property that distinguishes this class of algorithms from those of Chapter 2 is that the set VM= εj | j = 1,2,…,M, ej = jth unit vector consisting of the corners of the simplex SM form the (only) absorbing states of the Markov process p(k). Each ej c VM is topologically closed and it will be shown that ej is also stochastically closed under NAR−P algorithms. Thus, in this case each element of v (there are M of them) according to problem 1.1 form an ergodic kernel as opposed to the only one ergodic kernel (which is SM) of the NER−P algorithms discussed in chapter 2. The presence of such multiple ergodic kernel makes the asymptotic behavior of the Markov process p(k) under this class of algorithms, as to be expected, very much dependent on the initial state as against the independence of the asymptotic behavior of p(k) on the initial state in case of the NER−P algorithms.
S. Lakshmivarahan

Chapter 4. Time Varying Learning Algorithms

Abstract
In this chapter we analyze the time varying analogues of the Ergodic learning algorithms presented in Chapter 2. These algorithms are obtained from algorithm (2.1) by letting the step length parameter θ to vary with time. The time-varying algorithms generate non-stationary Markov processes over the simplex SM. Our aim is to present different methods of asymptotic analysis of these time varying learning algorithms. In the interest of brevity we will only consider the case of M = 2, leaving the obvious extensions as exercises.
S. Lakshmivarahan

Applications

Frontmatter

Chapter 5. Two Person Zero-Sum Sequential Stochastic Games with Imperfect and Incomplete Information — Game Matrix with Saddle Point in Pure Strategies

Abstract
In this and the next chapter we present an application of the learning algorithms developed in the previous chapters to two person zero sum games: Let A and B be the two players. Both are allowed to use mixed strategies. At any instant each player picks a pure strategy as a sample realization from his mixed strategy. As a result of their joint action they receive a random outcome which is either a success or failure. Since the game is a zero-sum game A’s success is B’s failure and vice-versa. The following assumptions are fundamental to our analysis: Either player has no knowledge of the set of pure strategies available to the other player or the pure strategy actually chosen by the other player at any stage of the game or the distribution of the random outcome as a function of the pure strategies chosen by them. Just based on the pure strategy chosen by him and the random outcome he receives both the players individually update their mixed strategies using a learning algorithm. This cycle continues and thus the game is played sequentially. In short we consider a zero-sum game between two players in which the players are totally decentralized, there is no communication or transfer of information between them either before or during the course of the play of the game and in fact they may not even know that they are involved in a game situation at all. In this set-up our aim is to find conditions on the learning algorithms such that both the players in the long run will receive an expected payoff as close to the well established game theoretic solutions (Von Neumann value) as desired.
S. Lakshmivarahan

Chapter 6. Two Person Zero-Sum Sequential Stochastic Games with Imperfect and Incomplete Information—General Case

Abstract
In this chapter we present a unified approach to two person zero sum games with incomplete and imperfect information wherein the game matrix may not always have a sadlle point in pure strategies. This is a natural extension of the problem of Chapter 5. Under the assumption that both players A and B use the LER−P learning algorithm with the same reward and penalty parameters but the penalty parameter being very small compared to the reward parameter, it is shown that the expected mixed strategy of either player can be made, asymptotically, as close to the optimal strategy dictated by the game theory as desired, irrespective of whether or not the game matrix has a saddle point in pure strategies.
S. Lakshmivarahan

Chapter 7. Two Person Decentralized Team Problem With Incomplete Information

Abstract
In this chapter we propose a learning approach to the two person decentralized team problem with incomplete information: Let A and B be the two persons each with two actions at their disposal. At any instant each person picks an action (perhaps randomly) and let i be the action chosen by A and j by B. As a result of their joint actions, both A and B receive the same outcome. The outcome is in general, a random variable whose distribution depends on the action pair (i,j). We assume that the outcome is two valued: +1, (unit gain) and −1 (unit loss) and that neither person has knowledge of the set of actions available to the other person or the actual action chosen by the other person at any instant of time or the distribution of the random outcome as a function of the pair (i,j) of actions. In other words, we consider an interaction between persons wherein there is no transfer of information of any kind at any stage. Just based on the action chosen and the random outcome he receives, each person decides to update the probability distribution over the set of his own actions using a learning algorithm. In this setup our problem is to find conditions on the learning algorithm such that asymptotically each person will maximize his expected outcome or payoff.
S. Lakshmivarahan

Chapter 8. Control of a Markov Chain with Unknown Dynamics and Cost Structure

Abstract
This chapter deals with the application of the “absolutely expedient” learning algorithms (developed in chapter 3) for the problem of control of a finite state Markov chain whose transition probabilities as a function of a finite number of control actions are unknown. At any instant of time depending on the state of the Markov chain and the control action chosen a reward is incurred. It is assumed that this reward is a two valued (binary) random variable whose distribution as a function of the state and the control action is unknown, but the sequence of states actually visited by the Markov chain is available. In other words we consider a Markov chain whose dynamics and reward structure are unknown but the state is observable exactly.
S. Lakshmivarahan

Epilogue

Epilogue

Abstract
Having completed eight chapters it is now time to ask: “Are these the only results on both the theory and applications that are known in the literature?” Before answering this question we wish to remind the reader of the existence of perpectual interaction (or feed back) between the theory and application in the sense that an older theory giving rise to a new application which in turn gives rise to newer theories and so on. In other words, it is extremely difficult in general, to pin down where a theory ends and an application begins. With this in mind, we wish to emphasize that “all most all” of the basic results on theory of convergence of various classes of learning algorithms known at the time ofthis writing are given in part I. On the application side however, our coverage is by no means complete. The choice of topics for inclusion in part II wasessentially dictated by our familiarity.
S. Lakshmivarahan

Backmatter

Weitere Informationen