Top

2012 | Book

Applied Multivariate Statistical Analysis

Authors: Wolfgang Karl Härdle, Léopold Simar

Publisher: Springer Berlin Heidelberg

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Most of the observable phenomena in the empirical sciences are of a multivariate nature. In financial studies, assets are observed simultaneously and their joint development is analysed to better understand general risk and to track indices. In medicine recorded observations of subjects in different locations are the basis of reliable diagnoses and medication. In quantitative marketing consumer preferences are collected in order to construct models of consumer behavior. The underlying data structure of these and many other quantitative studies of applied sciences is multivariate. Focusing on applications this book presents the tools and concepts of multivariate data analysis in a way that is understandable for non-mathematicians and practitioners who need to analyze statistical data. The book surveys the basic principles of multivariate statistical data analysis and emphasizes both exploratory and inferential statistics. All chapters have exercises that highlight applications in different fields.

The third edition of this book on Applied Multivariate Statistical Analysis offers the following new features

A new Chapter on Regression Models has been addedAll numerical examples have been redone, updated and made reproducible in MATLAB or R, see www.quantlet.org for a repository of quantlets.

Frontmatter

Descriptive Techniques

Frontmatter

Chapter 1. Comparison of Batches

Abstract

Multivariate statistical analysis is concerned with analysing and understanding data in high dimensions. We suppose that we are given a set $\{x_{i}\}^{n}_{i=1}$ of n observations of a variable vector X in $\mathbb {R}^{p}$. That is, we suppose that each observation x _i has p dimensions:

$$x_i = (x_{i1}, x_{i2}, \ldots , x_{ip}),$$

and that it is an observed value of a variable vector $X \in \mathbb {R}^{p}$. Therefore, X is composed of p random variables:

$$X = (X_{1}, X_{2}, \ldots , X_{p})$$

where X _j, for j=1,…,p, is a one-dimensional random variable. How do we begin to analyse this kind of data? Before we investigate questions on what inferences we can reach from the data, we should think about how to look at the data. This involves descriptive techniques. Questions that we could answer by descriptive techniques are:

Are there components of X that are more spread out than others?
Are there some elements of X that indicate sub-groups of the data?
Are there outliers in the components of X?
How “normal” is the distribution of the data?
Are there “low-dimensional” linear combinations of X that show “non-normal” behaviour?

Wolfgang Karl Härdle, Léopold Simar

Multivariate Random Variables

Frontmatter

Chapter 2. A Short Excursion into Matrix Algebra

Abstract

This chapter serves as a reminder of basic concepts of matrix algebra, which are particularly useful in multivariate analysis. It also introduces the notations used in this book for vectors and matrices. Eigenvalues and eigenvectors play an important role in multivariate techniques. In Sections 2.2 and 2.3, we present the spectral decomposition of matrices and consider the maximisation (minimisation) of quadratic forms given some constraints.

Wolfgang Karl Härdle, Léopold Simar

Chapter 3. Moving to Higher Dimensions

Abstract

We have seen in the previous chapters how very simple graphical devices can help in understanding the structure and dependency of data. The graphical tools were based on either univariate (bivariate) data representations or on “slick” transformations of multivariate information perceivable by the human eye. Most of the tools are extremely useful in a modelling step, but unfortunately, do not give the full picture of the data set. One reason for this is that the graphical tools presented capture only certain dimensions of the data and do not necessarily concentrate on those dimensions or sub-parts of the data under analysis that carry the maximum structural information. In Part III of this book, powerful tools for reducing the dimension of a data set will be presented. In this chapter, as a starting point, simple and basic tools are used to describe dependency. They are constructed from elementary facts of probability theory and introductory statistics (for example, the covariance and correlation between two variables).

Wolfgang Karl Härdle, Léopold Simar

Chapter 4. Multivariate Distributions

Abstract

The preceeding chapter showed that by using the two first moments of a multivariate distribution (the mean and the covariance matrix), a lot of information on the relationship between the variables can be made available. Only basic statistical theory was used to derive tests of independence or of linear relationships. In this chapter we give an introduction to the basic probability tools useful in statistical multivariate analysis.

Wolfgang Karl Härdle, Léopold Simar

Chapter 5. Theory of the Multinormal

Abstract

In the preceeding chapter we saw how the multivariate normal distribution comes into play in many applications. It is useful to know more about this distribution, since it is often a good approximate distribution in many situations. Another reason for considering the multinormal distribution relies on the fact that it has many appealing properties: it is stable under linear transforms, zero correlation corresponds to independence, the marginals and all the conditionals are also multivariate normal variates, etc. The mathematical properties of the multinormal make analyses much simpler.

Wolfgang Karl Härdle, Léopold Simar

Chapter 6. Theory of Estimation

Abstract

We know from our basic knowledge of statistics that one of the objectives in statistics is to better understand and model the underlying process which generates data. This is known as statistical inference: we infer from information contained in sample properties of the population from which the observations are taken. In multivariate statistical inference, we do exactly the same. The basic ideas were introduced in Section 4.5 on sampling theory: we observed the values of a multivariate random variable X and obtained a sample ${\mathcal{X}}=\{x_{i}\}_{i=1}^{n}$. Under random sampling, these observations are considered to be realisations of a sequence of i.i.d. random variables X ₁,…,X _n where each X _i is a p-variate random variable which replicates the parent or population random variable X. In this chapter, for notational convenience, we will no longer differentiate between a random variable X _i and an observation of it, x _i, in our notation. We will simply write x _i and it should be clear from the context whether a random variable or an observed value is meant.

Wolfgang Karl Härdle, Léopold Simar

Chapter 7. Hypothesis Testing

Abstract

In the preceding chapter, the theoretical basis of estimation theory was presented. Now we turn our interest towards testing issues: we want to test the hypothesis H ₀ that the unknown parameter θ belongs to some subspace of $\mathbb {R}^{q}$. This subspace is called the null set and will be denoted by $\Omega_{0} \subset \mathbb {R}^{q}$.

Wolfgang Karl Härdle, Léopold Simar

Multivariate Techniques

Frontmatter

Chapter 8. Regression Models

Abstract

The aim of regression models is to model the variation of a quantitative response variable y in terms of the variation of one or several explanatory variables (x ₁,…,x _p)^⊤. We have already introduced such models in Chapters 3 and 7 where linear models were written in (3.50) as

$$y={\mathcal{X}} \beta + \varepsilon,$$

where y(n×1) is the vector of observation for the response variable, ${\mathcal{X}} (n\times p)$ is the data matrix of the p explanatory variables and ε are the errors. Linear models are not restricted to handle only linear relationships between y and x. Curvature is allowed by including appropriate higher order terms in the design matrix ${\mathcal{X}}$.

Wolfgang Karl Härdle, Léopold Simar

Chapter 9. Decomposition of Data Matrices by Factors

Abstract

In Chapter 1 basic descriptive techniques were developed which provided tools for “looking” at multivariate data. They were based on adaptations of bivariate or univariate devices used to reduce the dimensions of the observations. In the following three chapters, issues of reducing the dimension of a multivariate data set will be discussed. The perspectives will be different but the tools will be related.

Wolfgang Karl Härdle, Léopold Simar

Chapter 10. Principal Components Analysis

Abstract

Chapter 9 presented the basic geometric tools needed to produce a lower dimensional description of the rows and columns of a multivariate data matrix. Principal components analysis has the same objective with the exception that the rows of the data matrix ${{\mathcal{X}}}$ will now be considered as observations from a p-variate random variable X. The principle idea of reducing the dimension of X is achieved through linear combinations. Low dimensional linear combinations are often easier to interpret and serve as an intermediate step in a more complex data analysis. More precisely one looks for linear combinations which create the largest spread among the values of X. In other words, one is searching for linear combinations with the largest variances.

Wolfgang Karl Härdle, Léopold Simar

Chapter 11. Factor Analysis

Abstract

A frequently applied paradigm in analyzing data from multivariate observations is to model the relevant information (represented in a multivariate variable X) as coming from a limited number of latent factors. In a survey on household consumption, for example, the consumption levels, X, of p different goods during one month could be observed. The variations and covariations of the p components of X throughout the survey might in fact be explained by two or three main social behavior factors of the household. For instance, a basic desire of comfort or the willingness to achieve a certain social level or other social latent concepts might explain most of the consumption behavior. These unobserved factors are much more interesting to the social scientist than the observed quantitative measures (X) themselves, because they give a better understanding of the behavior of households. As shown in the examples below, the same kind of factor analysis is of interest in many fields such as psychology, marketing, economics, politic sciences, etc.

Wolfgang Karl Härdle, Léopold Simar

Chapter 12. Cluster Analysis

Abstract

The next two chapters address classification issues from two varying perspectives. When considering groups of objects in a multivariate data set, two situations can arise. Given a data set containing measurements on individuals, in some cases we want to see if some natural groups or classes of individuals exist, and in other cases, we want to classify the individuals according to a set of existing groups. Cluster analysis develops tools and methods concerning the former case, that is, given a data matrix containing multivariate measurements on a large number of individuals (or objects), the objective is to build some natural subgroups or clusters of individuals. This is done by grouping individuals that are “similar” according to some appropriate criterion. Once the clusters are obtained, it is generally useful to describe each group using some descriptive tool from Chapters 1, 9 or 10 to create a better understanding of the differences that exist among the formulated groups.

Wolfgang Karl Härdle, Léopold Simar

Chapter 13. Discriminant Analysis

Abstract

Discriminant analysis is used in situations where the clusters are known a priori. The aim of discriminant analysis is to classify an observation, or several observations, into these known groups. For instance, in credit scoring, a bank knows from past experience that there are good customers (who repay their loan without any problems) and bad customers (who showed difficulties in repaying their loan). When a new customer asks for a loan, the bank has to decide whether or not to give the loan. The past records of the bank provides two data sets: multivariate observations x _i on the two categories of customers (including for example age, salary, marital status, the amount of the loan, etc.). The new customer is a new observation x with the same variables. The discrimination rule has to classify the customer into one of the two existing groups and the discriminant analysis should evaluate the risk of a possible “bad decision”.

Wolfgang Karl Härdle, Léopold Simar

Chapter 14. Correspondence Analysis

Abstract

Correspondence analysis provides tools for analyzing the associations between rows and columns of contingency tables. A contingency table is a two-entry frequency table where the joint frequencies of two qualitative variables are reported. For instance a (2×2) table could be formed by observing from a sample of n individuals two qualitative variables: the individual’s sex and whether the individual smokes. The table reports the observed joint frequencies. In general (n×p) tables may be considered.

Wolfgang Karl Härdle, Léopold Simar

Chapter 15. Canonical Correlation Analysis

Abstract

Complex multivariate data structures are better understood by studying low-dimensional projections. For a joint study of two data sets, we may ask what type of low-dimensional projection helps in finding possible joint structures for the two samples. The canonical correlation analysis is a standard tool of multivariate statistical analysis for discovery and quantification of associations between two sets of variables.

Wolfgang Karl Härdle, Léopold Simar

Chapter 16. Multidimensional Scaling

Abstract

One major aim of multivariate data analysis is dimension reduction. For data measured in Euclidean coordinates, Factor Analysis and Principal Component Analysis are dominantly used tools. In many applied sciences data is recorded as ranked information. For example, in marketing, one may record “product A is better than product B”. High-dimensional observations therefore often have mixed data characteristics and contain relative information (w.r.t. a defined standard) rather than absolute coordinates that would enable us to employ one of the multivariate techniques presented so far.

Wolfgang Karl Härdle, Léopold Simar

Chapter 17. Conjoint Measurement Analysis

Abstract

Conjoint Measurement Analysis plays an important role in marketing. In the design of new products it is valuable to know which components carry what kind of utility for the customer. Marketing and advertisement strategies are based on the perception of the new product’s overall utility. It can be valuable information for a car producer to know whether a change in sportiness or a change in safety or comfort equipment is perceived as a higher increase in overall utility. The Conjoint Measurement Analysis is a method for attributing utilities to the components (part worths) on the basis of ranks given to different outcomes (stimuli) of the product. An important assumption is that the overall utility is decomposed as a sum of the utilities of the components.

Wolfgang Karl Härdle, Léopold Simar

Chapter 18. Applications in Finance

Abstract

A portfolio is a linear combination of assets. Each asset contributes with a weight c _j to the portfolio. The performance of such a portfolio is a function of the various returns of the assets and of the weights c=(c ₁,…,c _p)^⊤. In this chapter we investigate the “optimal choice” of the portfolio weights c. The optimality criterion is the mean-variance efficiency of the portfolio. Usually investors are risk-averse, therefore, we can define a mean-variance efficient portfolio to be a portfolio that has a minimal variance for a given desired mean return. Equivalently, we could try to optimize the weights for the portfolios with maximal mean return for a given variance (risk structure). We develop this methodology in the situations of (non)existence of riskless assets and discuss relations with the Capital Assets Pricing Model (CAPM).

Wolfgang Karl Härdle, Léopold Simar

Chapter 19. Computationally Intensive Techniques

Abstract

It is generally accepted that training in statistics must include some exposure to the mechanics of computational statistics. This exposure to computational methods is of an essential nature when we consider extremely high dimensional data. Computer aided techniques can help us to discover dependencies in high dimensions without complicated mathematical tools. A draftman’s plot (i.e. a matrix of pairwise scatterplots like in Figure 1.14) may lead us immediately to a theoretical hypothesis (on a lower dimensional space) on the relationship of the variables. Computer aided techniques are therefore at the heart of multivariate statistical analysis.

Wolfgang Karl Härdle, Léopold Simar

Appendix

Frontmatter

Appendix A: Symbols and Notations

Abstract

In Appendix, we summarize the symbols and notations used in the book. We also give a brief interpretation about the data set, and all references come in the end.

Wolfgang Karl Härdle, Léopold Simar

Appendix B: Data

Abstract

All data sets are available on the Springer webpage or at the authors’ home pages. More detailed information on the data sets may be found there.

Wolfgang Karl Härdle, Léopold Simar

Backmatter

Title: Applied Multivariate Statistical Analysis
Authors: Wolfgang Karl Härdle
Léopold Simar
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-17229-8
Print ISBN: 978-3-642-17228-1
DOI: https://doi.org/10.1007/978-3-642-17229-8

Springer Professional

About this book

Table of Contents

Frontmatter

Descriptive Techniques

Frontmatter

Chapter 1. Comparison of Batches

Multivariate Random Variables

Frontmatter

Chapter 2. A Short Excursion into Matrix Algebra

Chapter 3. Moving to Higher Dimensions

Chapter 4. Multivariate Distributions

Chapter 5. Theory of the Multinormal

Chapter 6. Theory of Estimation

Chapter 7. Hypothesis Testing

Multivariate Techniques

Frontmatter

Chapter 8. Regression Models

Chapter 9. Decomposition of Data Matrices by Factors

Chapter 10. Principal Components Analysis

Chapter 11. Factor Analysis

Chapter 12. Cluster Analysis

Chapter 13. Discriminant Analysis

Chapter 14. Correspondence Analysis

Chapter 15. Canonical Correlation Analysis

Chapter 16. Multidimensional Scaling

Chapter 17. Conjoint Measurement Analysis

Chapter 18. Applications in Finance

Chapter 19. Computationally Intensive Techniques

Appendix

Frontmatter

Appendix A: Symbols and Notations

Appendix B: Data

Backmatter