Skip to main content
main-content

Über dieses Buch

Remarkable advances in computation and data storage and the ready availability of huge data sets have been the keys to the growth of the new disciplines of data mining and machine learning, while the enormous success of the Human Genome Project has opened up the field of bioinformatics.

These exciting developments, which led to the introduction of many innovative statistical tools for high-dimensional data analysis, are described here in detail. The author takes a broad perspective; for the first time in a book on multivariate analysis, nonlinear methods are discussed in detail as well as linear methods. Techniques covered range from traditional multivariate methods, such as multiple regression, principal components, canonical variates, linear discriminant analysis, factor analysis, clustering, multidimensional scaling, and correspondence analysis, to the newer methods of density estimation, projection pursuit, neural networks, multivariate reduced-rank regression, nonlinear manifold learning, bagging, boosting, random forests, independent component analysis, support vector machines, and classification and regression trees. Another unique feature of this book is the discussion of database management systems.

This book is appropriate for advanced undergraduate students, graduate students, and researchers in statistics, computer science, artificial intelligence, psychology, cognitive sciences, business, medicine, bioinformatics, and engineering. Familiarity with multivariable calculus, linear algebra, and probability and statistics is required. The book presents a carefully-integrated mixture of theory and applications, and of classical and modern multivariate statistical techniques, including Bayesian methods. There are over 60 interesting data sets used as examples in the book, over 200 exercises, and many color illustrations and photographs.

Inhaltsverzeichnis

Frontmatter

1. Introduction and Preview

Abstract
This book invites the reader to learn about multivariate analysis, its modern ideas, innovative statistical techniques, and novel computational tools, as well as exciting new applications.
Alan Julian Izenman

2. Data and Databases

Abstract
Multivariate data consist of multiple measurements, observations, or responses obtained on a collection of selected variables. The types of variables usually encountered often depend upon those who collect the data (the domain experts), possibly together with some statistical colleagues; for it is these people who actively decide which variables are of interest in studying a particular phenomenon. In other circumstances, data are collected automatically and routinely without a research direction in mind, using software that records every observation or transaction made regardless of whether it may be important or not.
Alan Julian Izenman

3. Random Vectors and Matrices

Abstract
This chapter builds the foundation for the statistical analysis of multivariate data. We first give the notation we use in this book, followed by a quick review of the rules for manipulating vectors and matrices. Then, we learn about random vectors and matrices, which are the fundamental building blocks for multivariate analysis. We then describe the properties of a variety of estimators of an unknown mean vector and unknown covariance matrix of a multivariate Gaussian distribution.
Alan Julian Izenman

4. Nonparametric Density Estimation

Abstract
Nonparametric techniques consist of sophisticated alternatives to traditional parametric models for studying multivariate data. What makes these alternative techniques so appealing to the data analyst is that they make no specific distributional assumptions and, thus, can be employed as an initial exploratory look at the data. In this chapter, we discuss methods for nonparametric estimation of a probability density function.
Alan Julian Izenman

5. Model Assessment and Selection in Multiple Regression

Abstract
Regression, as a scientific method, first appeared around 1885, although the method of least squares was discovered 80 years earlier. Least squares owes its origins to astronomy and, specifically, to Legendre’s 1805 pioneering work on the determination of the orbits of planets in which he introduced and named the method of least squares. Adrien Marie Legendre estimated the coefficients of a set of linear equations by minimizing the error sum of squares. Gauss stated in 1809 that he had been using the method since 1795, but could not prove his claim with documented evidence. Within a few years, Gauss and Pierre Simon Laplace added a probability component — a Gaussian curve to describe the error distribution — that was crucial to the success of the method. Gauss went on to devise an elimination algorithm to compute least-squares estimates. Once introduced, least squares caught on immediately in astronomy and geodetics, but it took 80 years for these ideas to be transported to other disciplines.
Alan Julian Izenman

6. Multivariate Regression

Abstract
Multivariate linear regression is a natural extension of multiple linear regression in that both techniques try to interpret possible linear relationships between certain input and output variables. Multiple regression is concerned with studying to what extent the behavior of a single output variable Y is influenced by a set of r input variables X = (X 1, …, X r)τ.
Alan Julian Izenman

7. Linear Dimensionality Reduction

Abstract
When faced with situations involving high-dimensional data, it is natural to consider the possibility of projecting those data onto a lower-dimensional subspace without losing important information regarding some characteristic of the original variables. One way of accomplishing this reduction of dimensionality is through variable selection, also called feature selection (see Section 5.7). Another way is by creating a reduced set of linear or nonlinear transformations of the input variables. The creation of such composite variables (or features) by projection methods is often referred to as feature extraction. Usually, we wish to find those low-dimensional projections of the input data that enjoy some sort of optimality properties.
Alan Julian Izenman

8. Linear Discriminant Analysis

Abstract
Suppose we are given a learning set \(\mathcal{L}\) of multivariate observations (i.e., input values \(\mathfrak{R}^r\)), and suppose each observation is known to have come from one of K predefined classes having similar characteristics. These classes may be identified, for example, as species of plants, levels of credit worthiness of customers, presence or absence of a specific medical condition, different types of tumors, views on Internet censorship, or whether an e-mail message is spam or non-spam.
Alan Julian Izenman

9. Recursive Partitioning and Tree-Based Methods

Abstract
An algorithm known as recursive partitioning is the key to the nonpara- metric statistical method of classification and regression trees (CART) (Breiman, Friedman, Olshen, and Stone, 1984). Recursive partitioning is the step-by-step process by which a decision tree is constructed by either splitting or not splitting each node on the tree into two daughter nodes. An attractive feature of the CART methodology (or the related C4.5 methodology; Quinlan, 1993) is that because the algorithm asks a sequence of hierarchical Boolean questions (e.g., is X j ≤ θ j ?, where θ j is a threshold value), it is relatively simple to understand and interpret the results.
Alan Julian Izenman

10. Artificial Neural Networks

Abstract
The learning technique of artificial neural networks (ANNs, or just neural networks or NNs) is the focus of this chapter. The development of ANNs evolved in periodic “waves” of research activity. ANNs were influenced by the fortunes of the fields of artificial intelligence and expert systems, which sought to answer questions such as: What makes the human brain such a formidable machine in processing cognitive thought? What is the nature of this thing called “intelligence” ? And, how do humans solve problems?
Alan Julian Izenman

11. Support Vector Machines

Abstract
Fisher’s linear discriminant function (LDF) and related classifiers for binary and multiclass learning problems have performed well for many years and for many data sets. Recently, a brand-new learning methodology, support vector machines (SVMs), has emerged (Boser, Guyon, and Vapnik, 1992), which has matched the performance of the LDF and, in many instances, has proved to be superior to it.
Alan Julian Izenman

12. Cluster Analysis

Abstract
Cluster analysis, which is the most well-known example of unsupervised learning, is a very popular tool for analyzing unstructured multivariate data. Within the data-mining community, cluster analysis is also known as data segmentation, and within the machine-learning community, it is also known as class discovery. The methodology consists of various algorithms each of which seeks to organize a given data set into homogeneous subgroups, or “clusters.” There is no guarantee that more than one such group can be found; however, in any practical application, the underlying hypothesis is that the data form a heterogeneous set that should separate into natural groups familiar to the domain experts.
Alan Julian Izenman

13. Multidimensional Scaling and Distance Geometry

Abstract
Imagine you have a map of a particular geographical region, which includes a number of cities and towns. Usually, such a map will be accompanied by a two-way table displaying how close a selected number of those towns and cities are to each other. Each cell of that table will show the degree of “closeness” (or proximity) of the row city to the column city that identifies that cell. The notion of proximity between two geographical locations is easy to understand, even though it could have different meanings: for example, proximity could be defined as straight-line distance or as shortest traveling distance.
Alan Julian Izenman

14. Committee Machines

Abstract
One of the most important research topics in machine learning is the problem of how to lower the generalization error of a learning algorithm, either by reducing the bias or the variance (or both). A major complication of any attempt to reduce variance or bias (or both) is that the definitions of “bias” and “variance” of a classification rule are not as obvious as they are in regression. In fact, there have been several conflicting suggestions for the bias-variance decomposition for classification problems.
Alan Julian Izenman

15. Latent Variable Models for Blind Source Separation

Abstract
Correspondence analysis is an exploratory multivariate technique for simultaneously displaying scores representing the row categories and column categories of a two-way contingency table as the coordinates of points in a low-dimensional (two- or possibly three-dimensional) vector space. The objective is to clarify the relationship between the row and column variates of the table and to discover a low-dimensional explanation for possible deviations from independence of those variates. The methodology has its own nomenclature, and its approach is decidedly geometric, especially for interpreting the resulting graphical displays.
Alan Julian Izenman

16. Nonlinear Dimensionality Reduction and Manifold Learning

Abstract
We have little visual guidance to help us identify any meaningful lowdimensional structure hidden in high-dimensional data. The linear projection methods of Chapter 7 can be extremely useful in discovering lowdimensional structure when the data actually lie in a linear (or approximately linear) lower-dimensional subspace (called a manifold) ℳ of input space ℜr. But what can we do if we know or suspect that the data actually lie on a low-dimensional nonlinear manifold, whose structure and dimensionality are both assumed unknown? Our goal of dimensionality reduction then becomes one of identifying the nonlinear manifold in question. The problem of recovering that manifold is known as nonlinear manifold learning.
Alan Julian Izenman

17. Correspondence Analysis

Abstract
Correspondence analysis is an exploratory multivariate technique for simultaneously displaying scores representing the row categories and column categories of a two-way contingency table as the coordinates of points in a low-dimensional (two- or possibly three-dimensional) vector space. The objective is to clarify the relationship between the row and column variates of the table and to discover a low-dimensional explanation for possible deviations from independence of those variates. The methodology has its own nomenclature, and its approach is decidedly geometric, especially for interpreting the resulting graphical displays.
Alan Julian Izenman

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise