Skip to main content

1997 | Buch

Stochastic Approximation Algorithms and Applications

verfasst von: Harold J. Kushner, G. George Yin

Verlag: Springer New York

Buchreihe : Stochastic Modelling and Applied Probability

insite
SUCHEN

Über dieses Buch

In recent years algorithms of the stochastic approximation type have found applications in new and diverse areas, and new techniques have been developed for proofs of convergence and rate of convergence. The actual and potential applications in signal processing have exploded. New challenges have arisen in applications to adaptive control. This book presents a thorough coverage of the ODE method used to analyze these algorithms.

Inhaltsverzeichnis

Frontmatter
1. Introduction: Applications and Issues
Abstract
This is the first of three chapters describing many concrete applications of stochastic approximation. The emphasis is on the problem description. Proofs of convergence and the derivation of the rate of convergence will be given in subsequent chapters for many of the examples. Since the initial work of Robbins and Monro in 1951, there has been a steady increase in the investigations of applications in many diverse areas, and this has accelerated in recent years, with new applications arising in queueing networks and manufacturing systems and in learning problems and neural nets, among others. We present only selected samples of these applications to illustrate the great breadth. The basic stochastic approximation algorithm is nothing but a stochastic difference equation with a small step size, and the basic questions for analysis concern its qualitative behavior over a long time interval, such as convergence and rate of convergence. The wide range of applications leads to a wide variety of such equations and associated stochastic processes.
Harold J. Kushner, G. George Yin
2. Applications to Learning, State Dependent Noise, and Queueing
Abstract
This chapter deals with more specific classes of examples, which are of increasing importance in current applications in many areas of technology. They are described in somewhat more detail than the examples of Chapter 1 are, and the illustration(s) given for each class are typical of those in a rapidly increasing literature. Section 1 deals with a problem in learning theory: the learning of an optimal hunting strategy by an animal, based on the history of successes and failures in repeated attempts to feed itself efficiently. Section 2 concerns the “learning” or “training” of a neural network. In the training phase, a random series of inputs is presented to the network, and there is a desirable response to each input. The problem is to adjust the weights in the network to minimize the average distance between the actual and desired responses. This is done by a training procedure, where the weights are adjusted after each (input, output) pair is observed. Loosely speaking, the increments in the weights are proportional to stochastic estimates of the derivative of the error with respect to the weights. Section 3 deals with an optimization problem for a controlled Markov chain model, where the transition probabilities are not known and one wishes to learn the optimal strategy in the course of the system’s operation.
Harold J. Kushner, G. George Yin
3. Applications in Signal Processing and Adaptive Control
Abstract
Adaptive control and communication theory provide numerous examples of the practical and effective use of stochastic approximation-type algorithms, and there is a large amount of literature concerning them. Only a few references will be listed. Despite the identity in form, whether the algorithms are actually called stochastic approximations depends on the traditions of the field, but the most effective techniques of proofs are often based on those of stochastic approximation. Our intention is to give some of the flavor of the applications. Proofs of convergence for some cases will appear in Chapter 9. Perhaps the most basic problem is the recursive estimation of the coefficients (parameters) of a linear system, when the input sequence is known, but only noise corrupted observations of its output can be measured. The output might simply be a weighted sum of the inputs, a weighted sum of the inputs and recent outputs, or a weighted sum of the inputs and recent outputs plus a weighted sum of the recent elements of a white noise sequence. The forms of the algorithms are essentially those of the pattern classification or “minimum mean squares” estimation problem of Example 3 in Section 1.1, although the noise terms might be correlated. In practice, the values of the parameters to be estimated often change with time, and the basic algorithm must be adjusted to allow “tracking.” A constant step size ε is generally used, which leads to the nontrivial problem of the choice of ε.
Harold J. Kushner, G. George Yin
4. Mathematical Background
Abstract
In this chapter we collect several mathematical ideas that will be frequently used in the sequel. The martingale process, discussed in Section 1, is one of the basic processes in stochastic analysis. It appears frequently as a “noise” term in the decomposition of the stochastic approximation algorithm, which will be used to facilitate the analysis. This was already apparent in many examples in Chapters 1 and 2, where the noise had the martingale difference property. Our method of analysis of the asymptotic properties of the stochastic approximation process involves showing that the asymptotic paths are those of the ordinary differential equation (ODE) determined by the “mean” dynamics of the algorithm. Some basic facts concerning ODEs and the Liapunov function method of proving their stability are discussed in Section 2. Section 3 concerns ODEs in which the dynamics are projected onto a compact constraint set. The solutions of such ODEs will appear as “limits” of the paths of constrained stochastic approximation algorithms. In Section 4, the stochastic analog of the Liapunov function method for proving stability is discussed. When the iterates in the stochastic approximation algorithms are unconstrained, such stability results will prove useful for showing that the paths are actually bounded.
Harold J. Kushner, G. George Yin
5. Convergence with Probability One: Martingale Difference Noise
Abstract
Much of the classical work in stochastic approximation dealt with the situation where the “noise” in each observation Y n is a martingale difference, that is, where there is a function g n (·) of θ such that E [Y n |Yi, i < n, θ0] = g n n ) [17, 40, 45, 47, 56, 79, 86, 132, 154, 159, 169, 181]. Then we can write Y n = g n n ) + δM n where δM n is a martingale difference. This “martingale difference noise” model is still of considerable importance. It arises, for example, where Y n has the form Y n = F n n , ψ n ) where ψ n are mutually independent. The convergence theory is relatively easy in this case, because the noise terms can be dealt with by well-known and relatively simple probability inequalities for martingale sequences. This chapter is devoted to this martingale difference noise case. Nevertheless, the ODE, compactness, and stability techniques to be introduced are of basic importance for stochastic approximation, and will be used in subsequent chapters.
Harold J. Kushner, G. George Yin
6. Convergence with Probability One: Correlated Noise
Abstract
The most powerful method in Chapter 5 was the “general compactness method” of Section 5.3. The basic assumption for that method is that the asymptotic “rate of change” of the martingale M 0(·) is zero with probability one. It was shown in Subsection 5.3.2 that the rate of change condition is quite weak and yields convergence under what are probably close to the weakest conditions on both the noise and step size sequences, when the underlying noise is of the martingale difference form. In this chapter it is shown that the same approach yields excellent results for correlated noise. General assumptions of the type used in Section 5.3 are stated in Subsection 1.1 and the main convergence theorem is proved in a straightforward way in Subsection 1.2.
Harold J. Kushner, G. George Yin
7. Weak Convergence: Introduction
Abstract
Up to now, we have concentrated on the convergence of {θ n } or of {θ n (·)} to an appropriate limit set with probability one. In this chapter, we work with a weaker type of convergence. In practical applications, this weaker type of convergence most often yields exactly the same information about the asymptotic behavior as the probability one methods. Yet the methods of proof are simpler (indeed, often substantially simpler), and the conditions are weaker and more easily verifiable. The weak convergence methods have considerable advantages when dealing with complicated problems, such as those involving correlated noise, state dependent noise processes, decentralized or asynchronous algorithms, and discontinuities in the algorithm. If probability one convergence is still desired, starting with a weak convergence argument can allow one to “localize” the probability one proof, thereby simplifying both the argument and the conditions that are needed. For example, the weak convergence proof might tell us that the iterates spend the great bulk of the time very near some point. Then a “local” method such as that for the “linearized” algorithm in Theorem 6.1.2 can be used. The basic ideas have many applications to problems in process approximation and for getting limit theorems for sequences of random processes.
Harold J. Kushner, G. George Yin
8. Weak Convergence Methods for General Algorithms
Abstract
The main results for the weak convergence of stochastic approximation algorithms are given in this chapter, using the ideas of Sections 7.3 and 7.4. The relative simplicity of the proofs and weakness of the required conditions in comparison with the probability one method will be seen. Section 1 lists the main conditions used for both the martingale difference noise case and the correlated “exogenous” noise case, when the step size does not change with time. For a single process with fixed step size ε n = ε, there can only be convergence in distribution as n → ∞. With an arbitrarily high probability, for small enough e the limit process is concentrated in an arbitrarily small neighborhood of some limit set of the limit mean ODE. This is a special case of the limit results in the theorem.
Harold J. Kushner, G. George Yin
9. Applications: Proofs of Convergence
Abstract
In this chapter, we apply the main techniques of Chapter 8 to examples from Chapters 2 and 3, and illustrate the flexibility and usefulness of the weak convergence approach. Sections 1 to 3 are concerned with function minimization, where the function is of the ergodic or average cost per unit time (over the infinite time interval) type. In such cases, one can only get estimates of derivatives over finite time intervals, and it needs to be shown that the averaging implicit in stochastic approximation yields the convergence of reasonable algorithms. In these sections, the results of Section 8.4 are applied to continuous time and discrete event dynamical systems that are of interest over a long time period. For example, one might wish to optimize or improve the average cost per unit time performance by sequentially adjusting the parameter θ. The variety of such applications and the literature connected with them are rapidly increasing. The examples will demonstrate the power of the basic ideas as well as the relative ease of using them in difficult problems. Section 1 is a general introduction to some of the key issues. The step sizes will generally be constant, but virtually the same conditions are needed for the case ε n → 0.
Harold J. Kushner, G. George Yin
10. Rate of Convergence
Abstract
The traditional definition of rate of convergence refers to the asymptotic properties of normalized errors about the limit point \( \bar \theta \). If ε n = ε for the Robbins—Monro algorithm, it is concerned with the asymptotic properties of \( U_n^ \in = \left( {\theta _n^ \in - \bar \theta } \right)/\sqrt \in \) for large n and small ∈.
Harold J. Kushner, G. George Yin
11. Averaging of the Iterates
Abstract
The difficulty of selecting a good step size sequence {ε n } has been a serious handicap in applications. In a fundamental paper, Polyak and Juditsky [142] showed that (loosely speaking) if ε n goes to zero slower than O(1/n), the averaged sequence \( \sum\nolimits_{i = 1}^n {{\theta _i}/n} {\text{ }} \) converges to its limit at an optimum rate.
Harold J. Kushner, G. George Yin
12. Distributed/Decentralized and Asynchronous Algorithms
Abstract
This chapter is concerned with decentralized and asynchronous forms of the stochastic approximation algorithms, a relatively new area of research. Compared with the rapid progress and extensive literature in stochastic approximation methods, the study of parallel stochastic approximations is still in its infancy. Perhaps the first work on the subject was [14], which dealt with a very particular class of algorithms, where similar computations were done by several processors and convex combinations were taken. The general ideas of weak convergence theory were applied to a fairly broad class of realistic algorithms in [111, 112]. The general ideas presented there and in [105, 171] form the basis of this chapter. Analogously to the problems in Chapter 8, those methods can handle correlated and state dependent noise, delays in communication, and asynchronous and distributed network forms. Various examples are given in Section 1. In the basic model, there are several processors; each one is responsible for the updating of only a part of the parameter vector. There might be overlaps in that several processors contribute to the updating of the same component of the parameter. Such models were treated in [105, 171, 172]. For a similar model, the problem of finding zeros of a nonlinear function with noisy observations via parallel processing methods and with random truncation bounds was treated in [201]. An attempt to get real-time implementable procedures via pipelining (see Section 1.2) of communication and computation for algorithms with “delayed” observations was in [203].
Harold J. Kushner, G. George Yin
Backmatter
Metadaten
Titel
Stochastic Approximation Algorithms and Applications
verfasst von
Harold J. Kushner
G. George Yin
Copyright-Jahr
1997
Verlag
Springer New York
Electronic ISBN
978-1-4899-2696-8
Print ISBN
978-1-4899-2698-2
DOI
https://doi.org/10.1007/978-1-4899-2696-8