Skip to main content

2011 | Buch

Probability for Statistics and Machine Learning

Fundamentals and Advanced Topics

insite
SUCHEN

Über dieses Buch

This book provides a versatile and lucid treatment of classic as well as modern probability theory, while integrating them with core topics in statistical theory and also some key tools in machine learning. It is written in an extremely accessible style, with elaborate motivating discussions and numerous worked out examples and exercises. The book has 20 chapters on a wide range of topics, 423 worked out examples, and 808 exercises. It is unique in its unification of probability and statistics, its coverage and its superb exercise sets, detailed bibliography, and in its substantive treatment of many topics of current importance.

This book can be used as a text for a year long graduate course in statistics, computer science, or mathematics, for self-study, and as an invaluable research reference on probabiliity and its applications. Particularly worth mentioning are the treatments of distribution theory, asymptotics, simulation and Markov Chain Monte Carlo, Markov chains and martingales, Gaussian processes, VC theory, probability metrics, large deviations, bootstrap, the EM algorithm, confidence intervals, maximum likelihood and Bayes estimates, exponential families, kernels, and Hilbert spaces, and a self contained complete review of univariate probability.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Review of Univariate Probability
Abstract
Probability is a universally accepted tool for expressing degrees of confidence or doubt about some proposition in the presence of incomplete information or uncertainty. By convention, probabilities are calibrated on a scale of 0 to 1; assigning something a zero probability amounts to expressing the belief that we consider it impossible, whereas assigning a probability of one amounts to considering it a certainty. Most propositions fall somewhere in between. Probability statements that we make can be based on our past experience, or on our personal judgments. Whether our probability statements are based on past experience or subjective personal judgments, they obey a common set of rules, which we can use to treat probabilities in a mathematical framework, and also for making decisions on predictions, for understanding complex systems, or as intellectual experiments and for entertainment. Probability theory is one of the most applicable branches of mathematics. It is used as the primary tool for analyzing statistical methodologies; it is used routinely in nearly every branch of science, such as biology, astronomy and physics, medicine, economics, chemistry, sociology, ecology, finance, and many others. A background in the theory, models, and applications of probability is almost a part of basic education. That is how important it is.
Anirban DasGupta
Chapter 2. Multivariate Discrete Distributions
Abstract
We have provided a detailed overview of distributions of one discrete or one continuous random variable in the previous chapter. But often in applications, we are just naturally interested in two or more random variables simultaneously. We may be interested in them simultaneously because they provide information about each other, or because they arise simultaneously as part of the data in some scientific experiment. For instance, on a doctor’s visit, the physician may check someone’s blood pressure, pulse rate, blood cholesterol level, and blood sugar level, because together they give information about the general health of the patient. In such cases, it becomes essential to know how to operate with many random variables simultaneously. This is done by using joint distributions. Joint distributions naturally lead to considerations of marginal and conditional distributions. We study joint, marginal, and conditional distributions for discrete random variables in this chapter. The concepts of these various distributions for continuous random variables are not different; but the techniques are mathematically more sophisticated. The continuous case is treated in the next chapter.
Anirban DasGupta
Chapter 3. Multidimensional Densities
Abstract
Similar to several discrete random variables, we are frequently interested in applications in studying several continuous random variables simultaneously. And similar to the case of one continuous random variable, again we do not speak of pmfs of several continuous variables, but of a pdf, jointly for all the continuous random variables. The joint density function completely characterizes the joint distribution of the full set of continuous random variables. We refer to the entire set of random variables as a random vector. Both the calculation aspects, as well as the application aspects of multidimensional density functions are generally sophisticated. As such, the ability to use and operate with multidimensional densities is among the most important skills one needs to have in probability and also in statistics. The general concepts and calculations are discussed in this chapter. Some special multidimensional densities are introduced separately in later chapters.
Anirban DasGupta
Chapter 4. Advanced Distribution Theory
Abstract
Studying distributions of functions of several random variables is of primary interest in probability and statistics. For example, the original variables X 1, X 2, , X n could be the inputs into some process or system, and we may be interested in the output, which is some suitable function of these input variables. Sums, products, and quotients are special functions that arise quite naturally in applications. These are discussed with a special emphasis in this chapter, although the general theory is also presented. Specifically, we present the classic theory of polar transformations and the Helmert transformation in arbitrary dimensions, and the development of the Dirichlet, t- and the F-distribution. The t- and the F-distribution arise in numerous problems in statistics, and the Dirichlet distribution has acquired an extremely special role in modeling and also in Bayesian statistics. In addition, these techniques and results are among the most sophisticated parts of distribution theory.
Anirban DasGupta
Chapter 5. Multivariate Normal and Related Distributions
Abstract
Multivariate normal distribution is the natural extension of the bivariate normal to the case of several jointly distributed random variables. Dating back to the works of Galton, Karl Pearson, Edgeworth, and later Ronald Fisher, the multivariate normal distribution has occupied the central place in modeling jointly distributed continuous random variables. There are several reasons for its special status. Its mathematical properties show a remarkable amount of intrinsic structure; the properties are extremely well studied; statistical methodologies in common use often have their best or optimal performance when the variables are distributed as multivariate normal; and, there is the multidimensional central limit theorem and its various consequences which imply that many kinds of functions of independent random variables are approximately normally distributed, in some suitable sense. We present some of the multivariate normal theory and facts with examples in this chapter.
Anirban DasGupta
Chapter 6. Finite Sample Theory of Order Statistics and Extremes
Abstract
The ordered values of a sample of observations are called the order statistics of the sample, and the smallest and the largest are called the extremes. Order statistics and extremes are among the most important functions of a set of random variables that we study in probability and statistics. There is natural interest in studying the highs and lows of a sequence, and the other order statistics help in understanding the concentration of probability in a distribution, or equivalently, the diversity in the population represented by the distribution. Order statistics are also useful in statistical inference, where estimates of parameters are often based on some suitable functions of the order statistics. In particular, the median is of very special importance. There is a well-developed theory of the order statistics of a fixed number n of observations from a fixed distribution, as also an asymptotic theory where n goes to infinity. We discuss the case of fixed n in this chapter. A distribution theory for order statistics when the observations are from a discrete distribution is complex, both notationally and algebraically, because of the fact that there could be several observations which are actually equal. These ties among the sample values make the distribution theory cumbersome. We therefore concentrate on the continuous case.
Anirban DasGupta
Chapter 7. Essential Asymptotics and Applications
Abstract
Asymptotic theory is the study of how distributions of functions of a set of random variables behave, when the number of variables becomes large. One practical context for this is statistical sampling, when the number of observations taken is large. Distributional calculations in probability are typically such that exact calculations are difficult or impossible. For example, one of the simplest possible functions of n variables is their sum, and yet in most cases, we cannot find the distribution of the sum for fixed n in an exact closed form. But the central limit theorem allows us to conclude that in some cases the sum will behave as a normally distributed random variable, when n is large. Similarly, the role of general asymptotic theory is to provide an approximate answer to exact solutions in many types of problems, and often very complicated problems. The nature of the theory is such that the approximations have remarkable unity of character, and indeed nearly unreasonable unity of character. Asymptotic theory is the single most unifying theme of probability and statistics. Particularly, in statistics, nearly every method or rule or tradition has its root in some result in asymptotic theory. No other branch of probability and statistics has such an incredibly rich body of literature, tools, and applications, in amazingly diverse areas and problems. Skills in asymptotics are nearly indispensable for a serious statistician or probabilist.
Anirban DasGupta
Chapter 8. Characteristic Functions and Applications
Abstract
Characteristic functions were first systematically studied by Paul Lévy, although they were used by others before him. It provides an extremely powerful tool in probability in general, and in asymptotic theory in particular. The power of the characteristic function derives from a set of highly convenient properties. Like the mgf, it determines a distribution. But unlike mgfs, existence is not an issue, and it is a bounded function. It is easily transportable for common functions of random variables, such as convolutions. And it can be used to prove convergence of distributions, as well as to recognize the name of a limiting distribution. It is also an extremely handy tool in proving characterizing properties of distributions. For instance, the Cramér–Levy theorem (see Chapter 1), which characterizes a normal distribution, has so far been proved by only using characteristic function methods. There are two disadvantages in working with characteristic functions. First, it is a complex-valued function, in general, and so, familiarity with basic complex analysis is required. Second, characteristic function proofs usually do not lead to any intuition as to why a particular result should be true. All things considered, knowledge of basic characteristic function theory is essential for statisticians, and certainly for students of probability.
Anirban DasGupta
Chapter 9. Asymptotics of Extremes and Order Statistics
Abstract
We discussed the importance of order statistics and sample percentiles in detail in Chapter 6. The exact distribution theory of one or several order statistics was presented there. Although closed-form in principle, the expressions are complicated, except in some special cases, such as the uniform and the exponential. However, once again it turns out that just like sample means, order statistics and sample percentiles also have a very structured asymptotic distribution theory. We present a selection of the fundamental results on the asymptotic theory for order statistics and sample percentiles, including the sample extremes. Principal references for this chapter are David (1980), Galambos (1987), Serfling (1980), Reiss (1989), de Haan (2006), and DasGupta (2008); other references are given in the sections. First, we recall some notation for convenience.
Anirban DasGupta
Chapter 10. Markov Chains and Applications
Abstract
In many applications, successive observations of a process, say, X 1, X 2, , have an inherent time component associated with them. For example, the X i could be the state of the weather at a particular location on the ith day, counting from some fixed day. In a simplistic model, the state of the weather could be “dry” or “wet”, quantified as, say, 0 and 1. It is hard to believe that in such an example, the sequence X 1, X 2,  could be mutually independent. The question then arises how to model the dependence among the X i . Probabilists have numerous dependency models. A particular model that has earned a very special status is called a Markov chain. In a Markov chain model, we assume that the future, given the entire past and the present state of a process, depends only on the present state. In the weather example, suppose we want to assign a probability that tomorrow, say March 10, will be dry, and suppose that we have available to us the precipitation history for each of January 1 to March 9. The Markov chain model would entail that our probability that March 10 will be dry will depend only on the state of the weather on March 9, even though the entire past precipitation history was available to us. As simple as it sounds, Markov chains are enormously useful in applications, perhaps more than any other specific dependency model. They also are independently relevant to statistical computing in very important ways. The topic has an incredibly rich and well-developed theory, with links to many other topics in probability theory. Familiarity with basic Markov chain terminology and theory is often considered essential for anyone interested in studying statistics and probability. We present an introduction to basic Markov chain theory in this chapter.
Anirban DasGupta
Chapter 11. Random Walks
Abstract
We have already encountered the simple random walk a number of times in the previous chapters. Random walks occupy an extremely important place in probability because of their numerous applications, and because of their theoretical connections in suitable limiting paradigms to other important random processes in time. Random walks are used to model the value of stocks in economics, the movement of the molecules of a particle in a liquid medium, animal movements in ecology, diffusion of bacteria, movement of ions across cells, and numerous other processes that manifest random movement in time in response to some external stimuli. Random walks are indirectly of interest in various areas of statistics, such as sequential statistical analysis and testing of hypotheses. They also help a student of probability simply to understand randomness itself better.
Anirban DasGupta
Chapter 12. Brownian Motion and Gaussian Processes
Abstract
We started this text with discussions of a single random variable. We then proceeded to two and more generally, a finite number of random variables. In the last chapter, we treated the random walk, which involved a countably infinite number of random variables, namely the positions of the random walk S n at times n = 0, 1, 2, 3, . The time parameter n for the random walks we discussed in the last chapter belongs to the set of nonnegative integers, which is a countable set. We now look at a special continuous time stochastic process, which corresponds to an uncountable family of random variables, indexed by a time parametert belonging to a suitable uncountable time set T. The process we mainly treat in this chapter is Brownian motion, although some other Gaussian processes are also treated briefly.
Anirban DasGupta
Chapter 13. Poisson Processes and Applications
Abstract
A single theme that binds together a number of important probabilistic concepts and distributions, and is at the same time a major tool to the applied probabilist and the applied statistician is the Poisson process. The Poisson process is a probabilistic model of situations where events occur completely at random at intermittent times, and we wish to study the number of times the particular event has occurred up to a specific time instant, or perhaps the waiting time till the next event, and so on. Some simple examples are receiving phone calls at a telephone call center, receiving an e-mail from someone, arrival of a customer at a pharmacy or some other store, catching a cold, occurrence of earthquakes, mechanical breakdown in a computer or some other machine, and so on. There is no end to how many examples we can think of, where an event happens, then nothing happens for a while, and then it happens again, and it keeps going like this, apparently at random. It is therefore not surprising that the Poisson process is such a valuable tool in the probabilist’s toolbox. It is also a fascinating feature of Poisson processes that it is connected in various interesting ways to a number of special distributions, including the Poisson, exponential, Gamma, Beta, uniform, binomial, and the multinomial. These embracing connections and wide applications make the Poisson process a very special topic in probability.
Anirban DasGupta
Chapter 14. Discrete Time Martingales and Concentration Inequalities
Abstract
For an independent sequence of random variables X 1, X 2, , the conditional expectation of the present term of the sequence given the past terms is the same as its unconditional expectation. Martingales let the conditional expectation depend on the past terms, but in a special way. Thus, similar to Markov chains, martingales act as natural models for incorporating dependence into a sequence of observed data. But the value of the theory of martingales is much more than simply its modeling value. Martingales arise, as natural byproducts of the mathematical analysis in an amazing variety of problems in probability and statistics. Therefore, results from martingale theory can be immediately applied to all these situations in order to make deep and useful conclusions about numerous problems in probability and statistics. A particular modern set of applications of martingale methods is in the area of concentration inequalities, which place explicit bounds on probabilities of large deviations of functions of a set of variables from their mean values. This chapter gives a glimpse into some important concentration inequalities, and explains how martingale theory enters there. Martingales form a nearly indispensable tool for probabilists and statisticians alike.
Anirban DasGupta
Chapter 15. Probability Metrics
Abstract
The central limit theorem is a very good example of approximating a potentially complicated exact distribution by a simpler and easily computable approximate distribution. In mathematics, whenever we do an approximation, we like to quantify the error of the approximation. Common sense tells us that an error should be measured by some notion of distance between the exact and the approximate. Therefore, when we approximate one probability distribution (measure) by another, we need a notion of distances between probability measures. Fortunately, we have an abundant supply of distances between probability measures. Some of them are for probability measures on the real line, and others for probability measures on a general Euclidean space. Still others work in more general spaces. These distances on probability measures have other independent uses besides quantifying the error of an approximation. We provide a basic treatment of some common distances on probability measures in this chapter. Some of the distances have the so-called metric property, and they are called probability metrics, whereas some others satisfy only the weaker notion of being a distance. Our choice of which metrics and distances to include was necessarily subjective.
Anirban DasGupta
Chapter 16. Empirical Processes and VC Theory
Abstract
Like martingales, empirical processes also unify an incredibly large variety of problems in probability and statistics. Results in empirical processes theory are applicable to numerous classic and modern problems in probability and statistics; a few examples of applications are the study of central limit theorems in more general spaces than Euclidean spaces, the bootstrap, goodness of fit, density estimation, and machine learning. Familiarity with the basic theory of empirical processes is extremely useful across fields in probability and statistics.
Anirban DasGupta
Chapter 17. Large Deviations
Abstract
The mean μ of a random variable X is arguably the most common one number summary of the distribution of X. Although averaging is a primitive concept with some natural appeal, the mean μ is a useful summary only when the random variable X is concentrated around the mean μ, that is, probabilities of large deviations from the mean are small. The most basic large deviation inequality is Chebyshev’s inequality, which says that if X has a finite variance σ2, then \(P(\vert X - \mu \vert > k\sigma ) \leq \frac{1} {{k}^{2}}\). But, usually, this inequality is not strong enough in specific applications, in the sense that the assurance we seek is much stronger than what Chebyshev’s inequality will give us.
Anirban DasGupta
Chapter 18. The Exponential Family and Statistical Applications
Abstract
The exponential family is a practically convenient and widely used unified family of distributions on finite-dimensional Euclidean spaces parametrized by a finite-dimensional parameter vector. Specialized to the case of the real line, the exponential family contains as special cases most of the standard discrete and continuous distributions that we use for practical modeling, such as the normal, Poisson, binomial, exponential, Gamma, multivariate normal, and so on. The reason for the special status of the exponential family is that a number of important and useful calculations in statistics can be done all at one stroke within the framework of the exponential family. This generality contributes to both convenience and larger-scale understanding. The exponential family is the usual testing ground for the large spectrum of results in parametric statistical theory that require notions of regularity or Cramér–Rao regularity. In addition, the unified calculations in the exponential family have an element of mathematical neatness. Distributions in the exponential family have been used in classical statistics for decades. However, it has recently obtained additional importance due to its use and appeal to the machine learning community. A fundamental treatment of the general exponential family is provided in this chapter. Classic expositions are available in Barndorff-Nielsen (1978), Brown (1986), and Lehmann and Casella (1998). An excellent recent treatment is available in Bickel and Doksum (2006).
Anirban DasGupta
Chapter 19. Simulation and Markov Chain Monte Carlo
Abstract
Simulation is a computer-based exploratory exercise that aids in understanding how the behavior of a random or even a deterministic process changes in response to changes in input or the environment. It is essentially the only option left when exact mathematical calculations are impossible, or require an amount of effort that the user is not willing to invest. Even when the mathematical calculations are quite doable, a preliminary simulation can be very helpful in guiding the researcher to theorems that were not a priori obvious or conjectured, and also to identify the more productive corners of a particular problem. Although simulation in itself is a machine-based exercise, credible simulation must be based on appropriate theory. A simulation algorithm must be theoretically justified before we use it. This chapter gives a fairly broad introduction to the classic theory and techniques of probabilistic simulation, and also to some of the modern advents in simulation, particularly Markov chain Monte Carlo (MCMC) methods based on ergodic Markov chain theory.
Anirban DasGupta
Chapter 20. Useful Tools for Statistics and Machine Learning
Abstract
As much as we would like to have analytical solutions to important problems, it is a fact that many of them are simply too difficult to admit closed-form solutions. Common examples of this phenomenon are finding exact distributions of estimators and statistics, computing the value of an exact optimum procedure, such as a maximum likelihood estimate, and numerous combinatorial algorithms of importance in computer science and applied probability. Unprecedented advances in computing powers and availability have inspired creative new methods and algorithms for solving old problems; often, these new methods are better than what we had in our toolbox before. This chapter provides a glimpse into a few selected computing tools and algorithms that have had a significant impact on the practice of probability and statistics, specifically, the bootstrap, the EM algorithm, and the use of kernels for smoothing and modern statistical classification. The treatment is supposed to be introductory, with references to more advanced parts of the literature.
Anirban DasGupta
Backmatter
Metadaten
Titel
Probability for Statistics and Machine Learning
verfasst von
Anirban DasGupta
Copyright-Jahr
2011
Verlag
Springer New York
Electronic ISBN
978-1-4419-9634-3
Print ISBN
978-1-4419-9633-6
DOI
https://doi.org/10.1007/978-1-4419-9634-3