main-content

Computational inference has taken its place alongside asymptotic inference and exact techniques in the standard collection of statistical methods. Computational inference is based on an approach to statistical methods that uses modern computational power to simulate distributional properties of estimators and test statistics. This book describes computationally-intensive statistical methods in a unified presentation, emphasizing techniques, such as the PDF decomposition, that arise in a wide range of methods.

The book assumes an intermediate background in mathematics, computing, and applied and theoretical statistics. The first part of the book, consisting of a single long chapter, reviews this background material while introducing computationally-intensive exploratory data analysis and computational inference.

The six chapters in the second part of the book are on statistical computing. This part describes arithmetic in digital computers and how the nature of digital computations affects algorithms used in statistical methods. Building on the first chapters on numerical computations and algorithm design, the following chapters cover the main areas of statistical numerical analysis, that is, approximation of functions, numerical quadrature, numerical linear algebra, solution of nonlinear equations, optimization, and random number generation.

The third and fourth parts of the book cover methods of computational statistics, including Monte Carlo methods, randomization and cross validation, the bootstrap, probability density estimation, and statistical learning.

The book includes a large number of exercises with some solutions provided in an appendix.

### 1. Mathematical and Statistical Preliminaries

Abstract
The purpose of an exploration of data may be rather limited, and it may be ad hoc, or the purpose may be more general, perhaps to gain understanding of some natural phenomenon. The questions addressed in the data exploration may be somewhat openended. The process of understanding often begins with general questions about the structure of the data. At any stage of the analysis, our understanding is facilitated by means of a model. A model is a description that embodies our current understanding of a phenomenon. In an operational sense, we can formulate a model either as a description of a data-generating process, or as a prescription for processing data. The model is often expressed as a set of equations that relate data elements to each other. It may include probability distributions for the data elements. If any of the data elements are considered to be realizations of random variables, the model is a stochastic model. A model should not limit our analysis; rather, the model should be able to evolve. The process of understanding involves successive refinements of the model. The refinements proceed from vague models to more specific ones. An exploratory data analysis may begin by mining the data to identify interesting properties. These properties generally raise questions that are to be explored further. A class of models may have a common form within which the members of the class are distinguished by values of parameters. For example, the class of normal probability distributions has a single form of a probability density function that has two parameters. Within this family of probability distributions, these two parameters completely characterize the distributional properties. If this form of model is chosen to represent the properties of a dataset, we may seek confidence intervals for values of the two parameters or perform statistical tests of hypothesized values of these two parameters. In models that are not as mathematically tractable as the normal probability model—and many realistic models are not—we may need to use compu-tationally intensive methods involving simulations, resamplings, and multiple views to make inferences about the parameters of a model. These methods are part of the field of computational statistics.
James E. Gentle

### 2. Computer Storage and Arithmetic

Abstract
Data represent information at various levels. The form of data, whether numbers, characters, or picture elements, provide different perspectives. Data of whatever form are represented by groups of 0s and 1s, called bits from the words “binary” and “digits”. (The word was coined by John Tukey.) For representing simple text (that is, strings of characters with no special representation), the bits are usually taken in groups of eight, called bytes, or in groups of sixteen, and associated with a specific character according to a fixed coding rule. Because of the common association of a byte with a character, those two words are often used synonymously. For representing characters in bytes, “ASCII” (pronounced “askey”, from American Standard Code for Information Interchange), was the first standard code widely used. At first only English letters, Arabic numerals, and a few marks of punctuation had codes. Gradually over time more and more symbols were given codified representations. Also, because the common character sets differ from one language to another (both natural languages and computer languages), there are several modifications of the basic ASCII code set. When there is a need for more different characters than can be represented in a byte (28), codes to associate characters with larger groups of bits are necessary. For compatibility with the commonly used ASCII codes using groups of 8 bits, these codes usually are for groups of 16 bits. These codes for “16-bit characters” are useful for representing characters in some Oriental languages, for example. The Unicode Consortium has developed a 16-bit standard, called Unicode, that is widely used for representing characters from a variety of languages. For any ASCII character, the Unicode representation uses eight leading 0s and then the same eight bits as the ASCII representation. An important consideration in the choice of a method to represent data is the way data are communicated within a computer and between the computer and peripheral components such as data storage units. Data are usually treated as a fixed-length sequence of bits. The basic grouping of bits in a computer is sometimes called a “word” or a “storage unit”. The lengths of words or storage units commonly used in computers are 32 or 64 bits.
James E. Gentle

### 3. Algorithms and Programming

Abstract
We will use the term “algorithm” rather loosely but always in the general sense of a method or a set of instructions for doing something. Formally, an “algorithm” must terminate. Sometimes we may describe an algorithm that may not terminate simply following steps in our description. Whether we expressly say so or not, there should always be a check on the number of steps, and the algorithm should terminate after some large number of steps no matter what. Algorithms are sometimes distinguished as “numerical”, “seminumerical”, and “nonnumerical”, depending on the extent to which operations on real numbers are simulated.
James E. Gentle

### 4. Approximation of Functions and Numerical Quadrature

Abstract
The standard treatment of orthogonal polynomials is Szegö (1958), in which several other systems are described and more properties of orthogonal polynomials are discussed. A general reference on multivariate orthogonal polynomials is Dunkl and Yu (2001). A type of orthogonal system that I mentioned, but did not discuss, are wavelets. For this I refer the reader to Walter and Ghorai (1992) or to Vidakovic (2004).
De Boor (2002) provides a comprehensive development of splines and an extensive discussions of their properties. The emphasis is on B-splines and he gives several Fortran routines for using B-splines and other splines. A good introduction to multivariate splines is given by Chui (1988).
Evans and Schwartz (2000) provide a good summary of methods for numerical quadrature, including both the standard deterministic methods of numerical analysis and Monte Carlo methods. The most significant difficulties in numerical quadrature occur in multiple integration. The papers in the book edited by Flournoy and Tsutakawa (1991) provide good surveys of specific methods, especially ones with important applications in statistics.
James E. Gentle

### 5. Numerical Linear Algebra

Abstract
More complete coverage of the computational issues in linear algebra are covered in Čížková and Čížek (2004), in Part III of Gentle (2007), and in Golub and Van Loan (1996). Computational methods for sparse matrices are discussed in some detail in Saad (2003).
Mathematical software for linear algebra has traditionally been some of the best software, from the earlier days when libraries in various programming languages were widely distributed to the interpretive systems that allowed direct manipulation of vectors and matrices. Currently, there are several relatively mature interactive systems, including Matlab and Octave from an applied mathematics heritage, and S-Plus and R that emphasize statistical applications. There continues to be a need for specialized software for very large linear systems or for rather specialized applications. There are many libraries for these needs in Fortran or C/C++ that are freely available. A list of such software is maintained at
The R system is widely used by statisticians. This system provides a wide range of operations for vectors and matrices. It treats vectors as special kind of list and matrices as special kinds of rectangular arrays. An impact of this is in writing functions that accept matrices; a vector will not be accepted just as a matrix with one dimension being 1. A vector has a length attribute but not a dim attribute.
James E. Gentle

### 6. Solution of Nonlinear Equations and Optimization

Abstract
One of the most widely-encountered specialized optimization problems is the linear programming problem and related problems in network optimization. Griva, Nash, and Sofer (2008) describe methods for such problems.
Stochastic optimization is discussed in some detail by Spall (2004). De Jong (2006) describes the basic ideas of evolutionary computation and how the methods can be used in a variety of optimization problems.
Many of the relevant details of numerical optimization for statistical applications are discussed by Rustagi (1994) and by Gentle (2009). The EM method for optimization is covered in some detail by Ng, Krishnan, and McLachlan (2004). The EM method itself was first described and analyzed systematically by Dempster, Laird, and Rubin (1977).
Most of the comprehensive scientific software packages such as the IMSL Libraries, Matlab, and R have functions or separate modules for solution of systems of nonlinear equations and for optimization.
The R function uniroot (which is zbrent in the IMSL Libraries) is based on an algorithm of Richard Brent that uses a combination of linear interpolation, inverse quadratic interpolation, and bisection to find a root of a univariate function in an interval whose endpoints evaluate to values with different signs. The R function polyroot (which is zpolrc or zpolcc in the IMSL Libraries) is based on the Traub-Jenkins algorithm to find the roots of a univariate polynomial.
James E. Gentle

### 7. Generation of Random Numbers

Abstract
Monte Carlo simulation is a core technology in computational statistics. Monte Carlo methods require numbers that appear to be realizations of random variables. Obtaining these numbers is the process called “generation of random numbers”. Our objective is usually not to generate a truly random sample. Deep understanding of the generation process and strict reproducibility of any application involving the “random” numbers is more important. We often emphasize this perspective by the word “pseudorandom”, although almost anytime we use a phrase similar to “generation of random numbers”, we refer to “pseudorandom” numbers. The quality of a process for random number generation is measured by the extent to which the sample generated appears, from every imaginable perspective, to be a random sample (that is, i.i.d.) from a given probability distribution. Some methods of random number generation are better than others.
James E. Gentle

### 8. Graphical Methods in Computational Statistics

Abstract
One of the first steps in attempting to understand data is to visualize it. Visualization of data and information provides a wealth of tools that can be used in detecting features, in discovering relationships, and finally in retaining the knowledge gained. Graphical displays have always been an important part of statistical data analysis, but with the continuing developments in high-speed computers and high-resolution devices, the usefulness of graphics has greatly increased. Higher resolution makes for a more visually pleasing display, and occasionally it allows features to be seen that could not be distinguished otherwise. The most important effects of the computer on graphical methods in statistics, however, arise from the ease and speed with which graphical displays can be produced, rather than from the resolution. Rapid production of graphical displays has introduced motion and articulated projections and sections into statistical graphics. Such graphical methods are important tools of computational statistics. The multiple views are tools of discovery, not just ways of displaying a set of data that has already been analyzed. Although the power of graphical displays has greatly increased, some of the most useful graphs are the simple ones, as illustrated in Section 1.1, and they should not be ignored just because we can do more impressive things. Proper design of a graphical display depends on the context of the application and the purpose of the graphics, whether it is for the analyst to get a better understanding of the data or to present a picture that conveys a message. Our emphasis in the following discussion is on methods useful in exploratory graphics. One thing that is lagging in the statistical literature is the use of color in graphical displays. The simple mechanics used in producing popular magazines are yet to be incorporated in the production of learned journals. Journals available in electronic form do not have these production problems, and it is likely that the paper versions of many journals will be discontinued before the production of color is mastered.
James E. Gentle

### 9. Tools for Identification of Structure in Data

Abstract
In recent years, with our increased ability to collect and store data, have come enormous datasets. These datasets may consist of billions of observations and millions of variables. Some of the classical methods of statistical inference, in which a parametric model is studied, are neither feasible nor relevant for analysis of these datasets. The objective is to identify interesting structures in the data, such as clusters of observations, or relationships among the variables. Sometimes, the structures allow a reduction in the dimensionality of the data. Many of the classical methods of multivariate analysis, such as principal components analysis, factor analysis, canonical correlations analysis, and multidimensional scaling, are useful in identifying interesting structures. These methods generally attempt to combine variables in such a way as to preserve information yet reduce the dimension of the dataset. Dimension reduction generally carries a loss of some information. Whether the lost information is important is the major concern in dimension reduction. Another set of methods for reducing the complexity of a dataset attempts to group observations together, combining observations, as it were.
James E. Gentle

### 10. Estimation of Functions

Abstract
An interesting problem in statistics, and one that is generally difficult, is the estimation of a continuous function, such as a probability density function or a nonlinear regression model. The statistical properties of an estimator of a function are more complicated than statistical properties of an estimator of a single parameter or of a countable set of parameters. In Chapter 4 we discussed ways of numerically approximating functions. In this brief chapter we will discuss ways of statistically estimating functions. Many of these methods are based on approximation methods such as orthogonal systems, splines, and kernels discussed in Chapter 4. The PDF decomposition plays an important role in the estimation of functions.
James E. Gentle

### 11. Monte Carlo Methods for Statistical Inference

Abstract
Monte Carlo methods are experiments. Monte Carlo experimentation is the use of simulated random numbers to estimate some functional of a probability distribution. A problem that does not have a stochastic component can sometimes be posed as a problem with a component that can be identified with an expectation of some function of a random variable. This is often done by means of a PDF decomposition. The problem is then solved by estimating the expected value by use of a simulated sample from the distribution of the random variable. Monte Carlo methods use random numbers, so to implement a Monte Carlo method it is necessary to have a source of random numbers. On the computer, we generally settle for pseudorandom numbers, that is, numbers that appear to be random but are actually deterministic. Generation of pseudorandom numbers is the topic of Chapter 7. Often, our objective is not to simulate random sampling directly, but rather to estimate a specific quantity related to the distribution of a given sample. In this case, we may want to ensure that a chosen sample closely reflects the distribution of the population we are simulating. Because of random variation, a truly random sample or a pseudorandom sample that simulates a random sample would not necessarily have this property. Sometimes, therefore, we generate a quasirandom sample, which is a sample constrained to reflect closely the distribution of the population we are simulating, rather than to exhibit the variability that would result from random sampling. Because in either case we proceed to treat the samples as if they were random, we will refer to both pseudorandom numbers and quasirandom numbers as “random numbers”, except when we wish to emphasize the “pseudo” or “quasi” nature. In this chapter, we discuss various ways random numbers are used in statistical inference. Monte Carlo methods are also used in many of the techniques described in other chapters.
James E. Gentle

### 12. Data Randomization, Partitioning, and Augmentation

Abstract
Although subsampling, resampling, or otherwise rearranging a given dataset cannot increase its information content, these procedures can sometimes be useful in extracting information. Randomly rearranging the observed dataset, for example, can give an indication of how unusual the dataset is with respect to a given null hypothesis. This idea leads to randomization tests. There are many useful procedures for data analysis that involve partitioning the original sample. Using subsets of the full sample, we may be able to get an estimate of the bias or the variance of the standard estimator or test statistic without relying too heavily on the assumptions that led to that choice of estimator or test statistic. It is often useful to partition the dataset into two parts and use the data in the “training set” or “estimation set” to arrive at a preliminary estimate or fit and then use the data in the “validation set” or “test set” to evaluate the fit. This kind of approach is particularly appropriate when we are unsure of our model of the data-generating process. In actual applications, of course, we are always at least somewhat unsure of our model. If the full dataset is used to fit the model, we are very limited in the extent to which we can validate the model. No matter what we do with a given dataset, we are still left with uncertainty about the relevance of that dataset to future modeling problems. Prediction requires some assumption about the model of the data-generating processes, both the one that yielded the given data and the unseen one for which inferences are to be made. The variance of predictors is called prediction error or generalization error. Obviously, since this involves unseen data and unknown scenarios, there is no way of measuring or estimating this kind of error with any confidence. The use of partitions of the given dataset, however, is one way of getting some feel for the generalization error for scenarios that are somewhat similar to the one that gave rise to the observed dataset. Subsets of the data can be formed systematically or they can be formed as random samples from the given dataset. Sometimes the given dataset is viewed as a set of mass points of a finite distribution whose distribution function is the same as the empirical distribution function of the given dataset. In this case, the data partitioning may be done in such a way that observations may occur multiple times in the various partitions. In most cases, when we speak of “sets” or “subsets”, we mean “multisets” (that is, collections in which items may occur more than once and are distinguished in each occurrence).
James E. Gentle

### 13. Bootstrap Methods

Abstract
Resampling methods involve the use of many samples, each taken from a single sample that was taken from the population of interest. Inference based on resampling makes use of the conditional sampling distribution of a new sample (the “resample”) drawn from a given sample. Statistical functions on the given sample, a finite set, can easily be evaluated. Resampling methods therefore can be useful even when very little is known about the underlying distribution. A basic idea in bootstrap resampling is that, because the observed sample contains all the available information about the underlying population, the observed sample can be considered to be the population; hence, the distribution of any relevant test statistic can be simulated by using random samples from the “population” consisting of the original sample.
James E. Gentle

### 14. Estimation of Probability Density Functions Using Parametric Models

Abstract
Many of the common distributions arise from first principles. For example, an exponential distribution results from the definition of a Poisson process; a normal distribution arises from axioms about transformations or, from a different perspective, from central limit theorems. Other distributions, such as a gamma or a t are derived from those basic distributions.
Definitions of general, flexible families of distributions are motivated by modeling applications. The parameters of these families act as tuning parameters that control skewness and kurtosis or other general properties of a distribution.
For the Johnson family of distributions, Chou et al. (1994) identify the distributional classes and the appropriate transformations for a number of well-known distributions. Slifker and Shapiro (1980) describe a method for selection of the particular Johnson family based on ratios of quantiles of the density to be fitted. Chou et al. (1994) give a method for fitting Johnson curves using quantiles. Devroye (1986) describes a method for simulating variates from a Johnson family.
Albert, Delampady, and Polasek (1991) defined another family of distributions that is very similar to the lambda distributions with proper choice of the parameters. The family of distributions of Albert, Delampady, and Polasek is particularly useful in Bayesian analysis with location-scale models.
Solka, Poston, and Wegman (1995) describe visualization methods to accompany EM methods for estimation of the mixture parameter.
Everitt and Hand (1981) provide a good general discussion of the use of finite mixtures for representing distributions. Solka et al. (1998) describe how fitting mixtures to observed data can provide insight about the underlying structure of the data. Roeder and Wasserman (1997) describe the use of mixtures in Bayesian density estimation.
James E. Gentle

### 15. Nonparametric Estimation of Probability Density Functions

Abstract
Newman and Barkema (1999) discuss data structures for working with hexagonal grids and a related type of grid formed by a Kagomé lattice, which is a tessellation composed of hexagons and twice as many triangles, in which the hexagons meet at their vertices and spaces between three hexagons are form the triangles.
General discussions of tessellations are given by Conway and Sloane (1999) (particularly Chapter 2 of that book) and by Okabe et al. (2000). Conway and Sloane (1982) give algorithms for binning data into lattices of various types for dimensions from 2 to 8.
The orthogonal series estimators generally are based on a smoothing of a histogram; thus, the smoothing parameter is the bin size. In addition to the orthogonal series we discussed, other orthogonal systems can be used in density estimation. Walter and Ghorai (1992) describe the use of wavelets in density estimation, and discuss some of the problems in their use. Vidakovic (2004) also discusses density estimation and other applications of wavelets.
Kooperberg and Stone (1991) describe use of splines to smooth the histogram. Kim et al. (1999) propose use of the cumulative histogram and fitting it using Bézier curves.
James E. Gentle

### 16. Statistical Learning and Data Mining

Abstract
A major objective in data analysis is to identify interesting features or structure in the data. In this chapter, we consider the use of some of the tools and measures discussed in Chapters 9 and 10 to identify interesting structure. The graphical methods discussed in Chapter 8 are also very useful in discovering structure, but we do not consider those methods further in the present chapter. There are basically two ways of thinking about “structure”. One has to do with counts of observations. In this approach, patterns in the density are the features of interest. We may be interested in whether the density is multimodal, whether it is skewed, whether there are holes in the density, and so on. The other approach seeks to identify relationships among the variables. The two approaches are related in the sense that if there are relationships among the variables, the density of the observations is higher in regions in which the relationships hold. Relationships among variables are generally not exact, and the relationships are identified by the higher density of observations that exhibit the approximate relationships. An important kind of pattern in data is a relationship to time. Often, even though data are collected at different times, the time itself is not represented by a variable on the dataset. A simple example is one in which the data are collected sequentially at roughly equal intervals. In this case, the index of the observations may serve as a surrogate variable. Consider the small univariate dataset in Table 16.1, for example. A static view of a histogram of these univariate data, as in Figure 16.1, shows a univariate bimodal dataset. Figure 16.2, however, in which the data are plotted against the index (by rows in Table 16.1), shows a completely different structure. The data appear to be sinusoidal with an increasing frequency. The sinusoidal responses at roughly equal sampling intervals result in a bimodal static distribution, which is the structure seen in the histogram. Interesting structure may also be groups or clusters of data based on some measure of similarity, as discussed in Section 9.2 beginning on page 383. When there are separate groups in the data, but the observations do not contain an element or an index variable representing group membership, identifying nearby elements or clusters in the data requires some measure of similarity (or, equivalently, of dissimilarity).
James E. Gentle

### 17. Statistical Models of Dependencies

Abstract
In the models and data-generating processes we have considered in previous chapters of Part IV, all of the variables or features were treated essentially in the same way. In this chapter, we consider models in which a subset of variables, often just one, is of special interest. This variable is the “response”, and we seek to understand its dependence on the other variables. “Dependence” here refers to a stochastic relationship, not a causal one. By knowing the relationship of the other variables to the response variable, we may better understand the data-generating process, or we may be able to predict the response, given the values of the associated variables. The models we consider in this chapter describe the stochastic behavior of one variable, Y , possibly a vector, as a function of other variables. Models of this type that express dependencies are called regression models if Y is a numeric variable or classification models if Y is a categorical variable. If Y is a numeric variable that takes on only a countable number of values, the model can be considered either a regression model (sometimes a “generalized model”) or a classification model.
James E. Gentle