main-content

Every advance in computer architecture and software tempts statisticians to tackle numerically harder problems. To do so intelligently requires a good working knowledge of numerical analysis. This book equips students to craft their own software and to understand the advantages and disadvantages of different numerical methods. Issues of numerical stability, accurate approximation, computational complexity, and mathematical modeling share the limelight in a broad yet rigorous overview of those parts of numerical analysis most relevant to statisticians. In this second edition, the material on optimization has been completely rewritten. There is now an entire chapter on the MM algorithm in addition to more comprehensive treatments of constrained optimization, penalty and barrier methods, and model selection via the lasso. There is also new material on the Cholesky decomposition, Gram-Schmidt orthogonalization, the QR decomposition, the singular value decomposition, and reproducing kernel Hilbert spaces. The discussions of the bootstrap, permutation testing, independent Monte Carlo, and hidden Markov chains are updated, and a new chapter on advanced MCMC topics introduces students to Markov random fields, reversible jump MCMC, and convergence analysis in Gibbs sampling. Numerical Analysis for Statisticians can serve as a graduate text for a course surveying computational statistics. With a careful selection of topics and appropriate supplementation, it can be used at the undergraduate level. It contains enough material for a graduate course on optimization theory. Because many chapters are nearly self-contained, professional statisticians will also find the book useful as a reference.

### 1. Recurrence Relations

Abstract
Recurrence relations are ubiquitous in computational statistics and probability. Devising good recurrence relations is both an art and a science. One general theme is the alpha and omega principle; namely, most recurrences are derived by considering either the first or last event in a chain of events. The following examples illustrate this principle and some other commonly employed techniques.
Kenneth Lange

### 2. Power Series Expansions

Abstract
Power series expansions are old friends of all workers in the mathematical sciences [4, 5, 10]. This chapter emphasizes special techniques for handling and generating the power series encountered in computational statistics. Most expansions can be phrased in terms of recurrence relations. Logarithmic differentiation is one powerful device for developing recurrences. Our applications of logarithmic differentiation to problems such as the conversion between moments and cumulants illustrate some of the interesting possibilities.
Kenneth Lange

### 3. Continued Fraction Expansions

Abstract
A continued fraction [2, 3, 4, 5] is a sequence of fractions
$$f_n = b_0 + \frac{{a_1 }}{{b_1 + \frac{{a_1 }}{{b_2 + \frac{{a_3 }}{{b_3 + \cdots + \frac{{a_n }}{{b_n }}}}}}}}$$
(3.1)
formed from two sequences a 1, a 2,… and b 0, b 1,… of numbers. For typographical convenience, definition (3.1) is usually recast as
$$f_n = b_0 + \frac{a_1}{b_1 +} \frac{a_2}{b_2 +} \frac{a_3}{b_3 +} \ldots \frac{a_n}{b_n}.$$
In many practical examples, the approximant f n converges to a limit, which is typically written as
$$\lim\limits_{n \rightarrow \infty} f_n = b_0 + \frac{a_1}{b_1 +} \frac{a_2}{b_2 +} \frac{a_3}{b_3 +} \ldots.$$
Because the elements a n and b n of the two defining sequences can depend on a variable x, continued fractions offer an alternative to power series in expanding functions such as distribution functions. In fact, continued fractions can converge where power series diverge, and where both types of expansions converge, continued fractions often converge faster.
Kenneth Lange

### 4. Asymptotic Expansions

Abstract
Asymptotic analysis is a branch of mathematics dealing with the order of magnitude and limiting behavior of functions, particularly at boundary points of their domains of definition [1, 2, 4, 5, 7]. Consider, for instance, the function
$$f(x) = \frac{x^2 + 1}{x + 1}.$$
It is obvious that f(x) resembles the function x as $$x \rightarrow \infty$$. However, one can be more precise. The expansion
$$\begin{array}{rcl}f(x) &&= \frac{x^2 + 1}{x(1 + \frac{1}{x})}\\ &&= \left(x + \frac{1}{x}\right)\sum\limits_{k = 0}^{\infty}\left(\frac{-1}{x}\right)^{k}\\ &&= x - 1 - 2\sum\limits_{k = 1}^{\infty}\left(\frac{-1}{x}\right)^{k}\end{array}$$
indicates that f(x) more closely resembles x − 1 for large x. Furthermore, f(x) − x + 1 behaves like 2/x for large x. We can refine the precision of the approximation by taking more terms in the infinite series. How far we continue in this and other problems is usually dictated by the application at hand.
Kenneth Lange

### 5. Solution of Nonlinear Equations

Abstract
Solving linear and nonlinear equations is a major preoccupation of applied mathematics and statistics. For nonlinear equations, closed-form solutions are the exception rather than the rule. Here we will concentrate on three simple techniques—bisection, functional iteration, and Newton’s method— for solving equations in one variable. Insight into how these methods operate can be gained by a combination of theory and examples. Since functional iteration and Newton’s method generalize to higher-dimensional problems, it is particularly important to develop intuition about their strengths and weaknesses. Equipped with this intuition, we can tackle harder problems with more confidence and understanding.
Kenneth Lange

### 6. Vector and Matrix Norms

Abstract
In multidimensional calculus, vector and matrix norms quantify notions of topology and convergence [2, 4, 5, 6, 8, 12]. Because norms are also devices for deriving explicit bounds, theoretical developments in numerical analysis rely heavily on norms. They are particularly useful in establishing convergence and in estimating rates of convergence of iterative methods for solving linear and nonlinear equations. Norms also arise in almost every other branch of theoretical numerical analysis. Functional analysis, which deals with infinite-dimensional vector spaces, uses norms on functions.
Kenneth Lange

### 7. Linear Regression and Matrix Inversion

Abstract
Linear regression is the most commonly applied procedure in statistics. This fact alone underscores the importance of solving linear least squares problems quickly and reliably. In addition, iteratively reweighted least squares lies at the heart of a host of other optimization algorithms in statistics. The current chapter features four different methods for solving linear least squares problems: sweeping, Cholesky decomposition, the modified GramSchmidt procedure, and orthogonalization by Householder reflections. Later we take up solution by the singular value decomposition.
Kenneth Lange

### 8. Eigenvalues and Eigenvectors

Abstract
Finding the eigenvalues and eigenvectors of a symmetric matrix is one of the basic tasks of computational statistics. For instance, in principal components analysis [13], a random m-vector X with covariance matrix Ω is postulated.
Kenneth Lange

### 9. Singular Value Decomposition

Abstract
In many modern applications involving large data sets, statisticians are confronted with a large m×n matrix X = (x ij) that encodes n features on each of mobjects. For instance, in gene microarray studies x ij represents the expression level of the ith gene under the jth experimental condition [13]. In information retrieval, x ij represents the frequency of the jth word or term in the ith document [2]. The singular value decomposition (svd) captures the structure of such matrices. In many applications there are alternatives to the svd, but these are seldom as informative or as numerically accurate.
Kenneth Lange

### 10. Splines

Abstract
Splines are used for interpolating functions. Before the advent of computer graphics, a draftsman would draw a smooth curve through a set of points plotted on graph paper by forcing a flexible strip to pass over the points. The strip, made of wood, metal, or plastic, typically was held in place by weights as the draftsman drew along its edge. Subject to passing through the interpolating points, the strip or spline would minimize its stress by straightening out as much as possible. Beyond the terminal points on the left and right, the spline would be straight.
Kenneth Lange

### 11. Optimization Theory

Abstract
This chapter summarizes a handful of basic principles that permit the exact solution of many optimization problems. Misled by the beautiful examples of elementary calculus, students are disappointed when they cannot solve optimization problems analytically. More experienced scholars know that exact solutions are the exception rather than the rule. However, they cherish these exceptions because they form the basis of most iteration schemes in optimization.
Kenneth Lange

### 12. The MM Algorithm

Abstract
Most practical optimization problems defy exact solution. In the current chapter we discuss an optimization method that relies heavily on convexity arguments and is particularly useful in high-dimensional problems such as image reconstruction [27]. This iterative method is called the MM algorithm. One of the virtues of the MM acronym is that it does double duty. In minimization problems, the first M stands for majorize and the second M for minimize. In maximization problems, the first M stands for minorize and the second M for maximize. When it is successful, the MM algorithm substitutes a simple optimization problem for a difficult optimization problem. Simplicity can be attained by (a) avoiding large matrix inversions, (b) linearizing an optimization problem, (c) separating the variables of an optimization problem, (d) dealing with equality and inequality constraints gracefully, and (e) turning a nondifferentiable problem into a smooth problem. In simplifying the original problem, we pay the price of iteration or iteration with a slower rate of convergence.
Kenneth Lange

### 13. The EM Algorithm

Abstract
Maximum likelihood is the dominant form of estimation in applied statistics. Because closed-form solutions to likelihood equations are the exception rather than the rule, numerical methods for finding maximum likelihood estimates are of paramount importance. In this chapter we study maximum likelihood estimation by the EM algorithm [2, 8, 9], a special case of the MM algorithm. At the heart of every EM algorithm is some notion of missing data. Data can be missing in the ordinary sense of a failure to record certain observations on certain cases. Data can also be missing in a theoretical sense. We can think of the E (expectation) step of the algorithm as filling in the missing data. This action replaces the loglikelihood of the observed data by a minorizing function, which is then maximized in the M step. Because the surrogate function is usually much simpler than the likelihood, we can often solve the M step analytically. The price we pay for this simplification is iteration. Reconstruction of the missing data is bound to be slightly wrong if the parameters do not already equal their maximum likelihood estimates.
Kenneth Lange

### 14. Newton’s Method and Scoring

Abstract
The MM and EM algorithms are hardly the only methods of optimization. Newton’s method is better known and more widely applied. We encountered Newton’s method in Section 5.4 of Chapter 5. Here we focus on the multidimensional version. Despite its defects, Newton’s method is the gold standard for speed of convergence and forms the basis of many modern optimization algorithms. Its variants seek to retain its fast convergence while taming its defects. The variants all revolve around the core idea of locally approximating the objective function by a strictly convex quadratic function. At each iteration the quadratic approximation is optimized. Safeguards are introduced to keep the iterates from veering toward irrelevant stationary points.
Kenneth Lange

### 15. Local and Global Convergence

Abstract
Proving convergence of the various optimization algorithms is a delicate exercise. In general, it is helpful to consider local and global convergence patterns separately. The local convergence rate of an algorithm provides a useful benchmark for comparing it to other algorithms. On this basis, Newton’s method wins hands down. However, the tradeoffs are subtle. Besides the sheer number of iterations until convergence, the computational complexity and numerical stability of an algorithm are critically important. The MM algorithm is often the epitome of numerical stability and computational simplicity. Scoring lies somewhere between these two extremes. It tends to converge more quickly than the MM algorithm and to behave more stably than Newton’s method. Quasi-Newton methods also occupy this intermediate zone. Because the issues are complex, all of these algorithms survive and prosper in certain computational niches.
Kenneth Lange

Abstract
Our final chapter on optimization provides a concrete introduction to several advanced topics. The first vignette describes classical penalty and barrier methods for constrained optimization [22, 37, 45]. Penalty methods operate on the exterior and barrier methods on the interior of the feasible region. Fortunately, it is fairly easy to prove global convergence for both methods under reasonable hypotheses.
Kenneth Lange

### 17. Concrete Hilbert Spaces

Abstract
In this chapter we consider an infinite-dimensional generalization of Euclidean space introduced by the mathematician David Hilbert. This generalization preserves two fundamental geometric notions of Euclidean space— namely, distance and perpendicularity. Both of these geometric properties depend on the existence of an inner product. In the infinite-dimensional case, however, we take the inner product of functions rather than of vectors. Our emphasis here will be on concrete examples of Hilbert spaces relevant to statistics. To keep our discussion within bounds, some theoretical facts are stated without proof. Relevant proofs can be found in almost any book on real or functional analysis [6, 12]. Applications of our examples to numerical integration, wavelets, and other topics appear in later chapters.
Kenneth Lange

Abstract
The numerical calculation of one-dimensional integrals, or quadrature, is one of the oldest branches of numerical analysis. Long before calculus was invented, Archimedes found accurate approximations to π by inscribed and circumscribed polygons on a circle of unit radius. In modern applied mathematics and statistics, quadrature is so pervasive that even hand-held calculators are routinely programmed to perform it. Nonetheless, gaining a theoretical understanding of quadrature is worth the effort. In many scientific problems, large numbers of quadratures must be carried out quickly and accurately.
Kenneth Lange

### 19. The Fourier Transform

Abstract
The Fourier transform is one of the most productive tools of the mathematical sciences. It crops up again and again in unexpected applications to fields as diverse as differential equations, numerical analysis, probability theory, number theory, quantum mechanics, optics, medical imaging, and signal processing [3, 5, 7, 8, 9]. One explanation for its wide utility is that it turns complex mathematical operations like differentiation and convolution into simple operations like multiplication. Readers most likely are familiar with the paradigm of transforming a mathematical equation, solving it in transform space, and then inverting the solution. Besides its operational advantages, the Fourier transform often has the illuminating physical interpretation of decomposing a temporal process into component processes with different frequencies.
Kenneth Lange

### 20. The Finite Fourier Transform

Abstract
In previous chapters we have met Fourier series and the Fourier transform. These are both incarnations of Fourier analysis on a commutative group, namely the unit circle and the real line under addition. In this chapter we study Fourier analysis in the even simpler setting of the additive group of integers modulo a fixed positive integer n [6, 12]. Here, for obvious reasons, the Fourier transform is called the finite Fourier transform. Although the finite Fourier transform has many interesting applications in abstract algebra, combinatorics, number theory, and complex variables [8], we view it mainly as a tool for approximating Fourier series. Computation of finite Fourier transforms is done efficiently by an algorithm known as the fast Fourier transform [1, 3, 5, 9, 13, 15]. Although it was discovered by Gauss, the fast Fourier transform has come into prominence only with the advent of modern computing. As an indication of its critical role in many scientific and engineering applications, it is often implemented in hardware rather than software.
Kenneth Lange

### 21. Wavelets

Abstract
Wavelets are just beginning to enter statistical theory and practice [2, 5, 7, 10, 12]. The pace of discovery is still swift, and except for orthogonal wavelets, the theory has yet to mature. However, the advantages of wavelets are already obvious in application areas such as image compression. Wavelets are more localized in space than the competing sines and cosines of Fourier series. They also use fewer coefficients in representing images, and they pick up edge effects better. The secret behind these successes is the capacity of wavelets to account for image variation on many different scales.
Kenneth Lange

### 22. Generating Random Deviates

Abstract
Statisticians rely on a combination of mathematical theory and statistical simulation to develop new methods. Because simulations are often conducted on a massive scale, it is crucial that they be efficiently executed. In the current chapter, we investigate techniques for producing random samples from univariate and multivariate distributions. These techniques stand behind every successful simulation and play a critical role in Monte Carlo integration. Exceptionally fast code for simulations almost always depends on using a lower-level computer language such as C or Fortran. This limitation forces the statistician to write custom software. Mastering techniques for generating random variables (or deviates in this context) is accordingly a useful survival skill.
Kenneth Lange

### 23. Independent Monte Carlo

Abstract
Monte Carlo integration is a rough and ready technique for calculating high-dimensional integrals and dealing with nonsmooth integrands [4, 5, 6, 8, 9, 10, 11, 12, 13]. Although quadrature methods can be extended to multiple dimensions, these deterministic techniques are almost invariably defeated by the curse of dimensionality. For example, if a quadrature method relies on n quadrature points in one dimension, then its product extension to d dimensions relies on n d quadrature points. Even in one dimension, quadrature methods perform best for smooth functions. Both Romberg acceleration and Gaussian quadrature certainly exploit smoothness.
Kenneth Lange

### 24. Permutation Tests and the Bootstrap

Abstract
In this chapter we discuss two techniques, permutation testing and the bootstrap, of immense practical value. Both techniques involve random resampling of observed data and liberate statisticians from dubious model assumptions and large sample requirements. Both techniques initially met with considerable intellectual resistance. The notion that one can conduct hypothesis tests or learn something useful about the properties of estimators and confidence intervals by resampling data was alien to most statisticians of the past. The computational demands of permutation testing and bootstrapping alone made them unthinkable. These philosophical and practical objections began to crumble with the advent of modern computing.
Kenneth Lange

### 25. Finite-State Markov Chains

Abstract
Applied probability and statistics thrive on models. Markov chains are one of the richest sources of good models for capturing dynamical behavior with a large stochastic component [2, 3, 7, 9, 13, 18, 19, 21]. Certainly, every research statistician should be comfortable formulating and manipulating Markov chains. In this chapter we give a quick overview of some of the relevant theory of Markov chains in the simple context of finite-state chains. We cover both discrete-time and continuous-time chains in what we hope is a lively blend of applied probability, graph theory, linear algebra, and differential equations. Since this may be a first account for many readers, we stress intuitive explanations and computational techniques rather than mathematical rigor.
Kenneth Lange

### 26. Markov Chain Monte Carlo

Abstract
The Markov chain Monte Carlo (MCMC) revolution sweeping statistics is drastically changing how statisticians perform integration and summation. In particular, the Metropolis algorithm and Gibbs sampling make it straightforward to construct a Markov chain that samples from a complicated conditional distribution. Once a sample is available, then any conditional expectation can be approximated by forming its corresponding sample average. The implications of this insight are profound for both classical and Bayesian statistics. As a bonus, trivial changes to the Metropolis algorithm yield simulated annealing, a general-purpose algorithm for solving difficult combinatorial optimization problems.
Kenneth Lange

### 27. Advanced Topics in MCMC

Abstract
The pace of research on MCMC methods is so quick that any survey of advanced topics is immediately obsolete. The highly eclectic and decidedly biased coverage in our final chapter begins with a discussion of Markov random fields. Our limited aims here are to prove the Hammersley-Clifford theorem and introduce the Swendsen-Wang algorithm, a clever form of slice sampling. In the Ising model, the Swendsen-Wang algorithm is much more efficient than standard Gibbs sampling.
Kenneth Lange