2015 | Book

# Applied Multivariate Statistics with R

Author: Daniel Zelterman

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

2015 | Book

Author: Daniel Zelterman

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

This book brings the power of multivariate statistics to graduate-level practitioners, making these analytical methods accessible without lengthy mathematical derivations. Using the open source, shareware program R, Professor Zelterman demonstrates the process and outcomes for a wide array of multivariate statistical applications. Chapters cover graphical displays, linear algebra, univariate, bivariate and multivariate normal distributions, factor methods, linear regression, discrimination and classification, clustering, time series models, and additional methods. Zelterman uses practical examples from diverse disciplines to welcome readers from a variety of academic specialties. Those with backgrounds in statistics will learn new methods while they review more familiar topics. Chapters include exercises, real data sets, and R implementations. The data are interesting, real-world topics, particularly from health and biology-related contexts. As an example of the approach, the text examines a sample from the Behavior Risk Factor Surveillance System, discussing both the shortcomings of the data as well as useful analyses. The text avoids theoretical derivations beyond those needed to fully appreciate the methods. Prior experience with R is not necessary.

Advertisement

Abstract

WE ARE SURROUNDED by data. How is multivariate data analysis different from more familiar univariate methods? This chapter provides a summary of most of the major topics covered in this book. We also want to provide advocacy for the multivariate methods developed.

Abstract

HE SOFTWARE PACKAGE “

`R`

” has become very popular over the past decade for good reason. It has open source, meaning you can examine exactly what steps the program is performing. Compare this to the “black box” approach adopted by so many other software packages that hope you will just push the `Enter`

button and accept the results that appear. Another feature of open software is that if you identify a problem, you can fix it, or, at least, publicize the error until it gets fixed. Finally, once you become proficient at `R`

you can contribute to it. One of the great features of `R`

is the availability of packages of programs written by `R`

users for other `R`

users.Abstract

HERE ARE MANY GRAPHICAL METHODS that can be demonstrated with little or no explanation. You are likely familiar with histograms and scatterplots. Many options in

`R`

have improved on these in interesting and useful ways. The ability to produce statistical graphics is a clear strength of `R`

. Graphics commands can produce files of a variety of file formats. All of the figures in this book were produced in this manner, for example. We begin with a discussion of the basics. Later sections of this chapter demonstrate a variety of more complex procedures available to us.Abstract

MANY OPERATIONS performed on multivariate data are facilitated using vector and matrix notation. In this chapter we introduce the basic operations and properties of these and then show how to perform them in

`R`

.Abstract

HE NORMAL DISTRIBUTION is central to much of statistics. In this chapter and the two following, we develop the normal model from the univariate, bivariate, and then, finally, the more general distribution with an arbitrary number of dimensions.

Abstract

HE BIVARIATE NORMAL DISTRIBUTION helps us make the important leap from the univariate normal to the more general multivariate normal distribution. To accomplish this, we need to make the transition from the scalar univariate notation of the previous chapter to the matrix notation of the following chapter.

Abstract

IN THIS CHAPTER, we generalize the bivariate normal distribution from the previous chapter to an arbitrary number of dimensions. We also make use of the matrix notation. The mathematics is generally more dense and relies on the linear algebra notation covered in Chap. 4 In Sect. 4.5 we pointed out there is a limit on what computations we can reasonably perform by hand. For this reason, we illustrate these various operations with the help of

`R`

.Abstract

THE PREVIOUS CHAPTER described inference on the multivariate normal distribution. Sometimes this is more than we actually need. The multivariate distribution is used as a basis of modeling means and covariances. The covariances describe the multivariate relationship between pairs of individual attributes. In this chapter we go further and describe methods for identifying relationships between several variables concurrently. In the following chapter we will use regression methods to model the means.

Abstract

INEAR REGRESSION is probably one of the most powerful and useful tools available to the applied statistician. This method uses one or more variables to explain the values of another. Statistics alone cannot prove a cause and effect relationship, but we can show how changes in one set of measurements are associated with changes of the average values in another.

Abstract

IF WE HAVE multivariate observations from two or more identified populations, how can we characterize them? Is there a combination of measurements that can be used to clearly distinguish between these groups? It is not good enough to simply say that the mean of one variable is statistically higher in one group in order to solve this problem because the histograms of the groups may have considerable overlap making the discriminatory process only a little better than guesswork. To think in multivariate terms, we do not use only one variable at a time to distinguish between groups of individuals, but, rather, we use a combination of explanatory variables.

Abstract

CLUSTERING is a nonparametric method of arranging similar observations together, often in a graphical display that can be used to detect patterns of grouping and outliers. The approach is usually considered nonparametric because there is no specified underlying distribution or model that we need to assume.

`R`

offers great flexibility in graphical capability that makes these methods possible. The largest difference between these methods and those considered in the previous chapter is that in this chapter we do not know group membership a priori, or whether in fact there are different groups at all. Similarly, part of the methods discussed here include estimates of the number of dissimilar groups present in the data.
Abstract

HE MODELS for data described so far have been concerned with independent observations on multivariate values. The data examined in this chapter is for settings where the successive observations are also correlated. This type of data appears frequently in environmental and economic studies where a sequence of observations are taken over an evenly spaced time period.

Abstract

THIS FINAL CHAPTER provides a collection of useful multivariate methods that do not fit into any of the previous chapters. The Bradley–Terry model gives a way to rank a set of objects that are subjected to pairwise comparisons. Such examples include sports teams that play against each other. Canonical correlations generalize the definition of correlation of a pair of scalar-valued variates to two groups of several variables that are considered jointly. The study of extremes allows us to examine several of the largest values in a collection of data.