2022 | Book

# Applied Multivariate Statistics with R

Author: Daniel Zelterman

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

2022 | Book

Author: Daniel Zelterman

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

Now in its second edition, this book brings multivariate statistics to graduate-level practitioners, making these analytical methods accessible without lengthy mathematical derivations. Using the open source shareware program R, Dr. Zelterman demonstrates the process and outcomes for a wide array of multivariate statistical applications. Chapters cover graphical displays; linear algebra; univariate, bivariate and multivariate normal distributions; factor methods; linear regression; discrimination and classification; clustering; time series models; and additional methods. He uses practical examples from diverse disciplines, to welcome readers from a variety of academic specialties. Each chapter includes exercises, real data sets, and R implementations. The book avoids theoretical derivations beyond those needed to fully appreciate the methods. Prior experience with R is not necessary.

New to this edition are chapters devoted to longitudinal studies and the clustering of large data. It is an excellent resource for students of multivariate statistics, as well as practitioners in the health and life sciences who are looking to integrate statistics into their work.

Advertisement

Abstract

WE ARE SURROUNDED by data. How is multivariate data analysis different from more familiar univariate methods? This chapter provides a summary of most of the major topics covered in this book. We also want to provide advocacy for the multivariate methods developed.

Abstract

THE SOFTWARE PACKAGE

`R`

has become very popular over the past decade for good reason. It has open source, meaning you can examine exactly what steps the program is performing. Compare this to the “black box” approach adopted by so many other software packages whose authors hope you will just push the `Enter`

button and accept the results. Another feature of open software is if you identify a problem, you can fix it, or at least, publicize the error until it gets fixed. Finally, once you become proficient at `R`

you can contribute to it. One of the great features of `R`

is the availability of packages of programs written by `R`

users for other `R`

users. Best of all, `R`

is free for the asking and easy to install on your computer.Abstract

THERE ARE MANY GRAPHICAL METHODS to be demonstrated with little or no explanation. You are likely familiar with histograms and scatterplots. Many options in

`R`

have improved on these in interesting and useful ways. The ability to produce statistical graphics is a clear strength of `R`

.Abstract

MANY OPERATIONS performed on multivariate data are facilitated using vector and matrix notation. In this chapter, we introduce the basic operations and properties of these and then show how to perform them in

`R`

.Abstract

THE NORMAL DISTRIBUTION is central to much of statistics. In this chapter and the two following, we develop the normal model from the univariate, bivariate, and then, finally, the more general distribution with an arbitrary number of dimensions.

Abstract

THE BIVARIATE NORMAL DISTRIBUTION helps us make the important leap from the univariate normal to the more general multivariate normal distribution. To accomplish this, we need to make the transition from the scalar univariate notation of the previous chapter to the matrix notation of the following chapter.

Abstract

IN THIS CHAPTER, we generalize the bivariate normal distribution from the previous chapter to an arbitrary number of dimensions. We also make use of the matrix notation. The mathematics is generally more dense and relies on the linear algebra notation covered in Chapter 4. In Sect. 4.5, we pointed out there is a limit on what computations we can reasonably perform by hand. For this reason, we illustrate these various operations with the help of

`R`

.Abstract

THE PREVIOUS CHAPTER described inference on the multivariate normal distribution. Sometimes this is more than we actually need. The multivariate distribution is used as a basis of modeling means and covariances. The covariances describe the multivariate relationship between pairs of individual attributes. In this chapter, we go further and describe methods for identifying relationships between several variables concurrently. In the following chapter, we will use regression methods to model the means.

Abstract

LINEAR REGRESSION is probably one of the most powerful and useful tools available to the applied statistician. This method uses one or more variables to explain the values of another. Statistics alone cannot prove a cause and effect relationship, but we can do show how changes in one set of measurements are associated with changes of the average values in another.

Abstract

IF WE HAVE multivariate observations from two or more identified populations, how can we characterize them? Is there a combination of measurements to clearly distinguish between these groups? It is not good enough to simply say the mean of one variable is statistically higher in one group in order to solve this problem, because the histograms of the groups may have considerable overlap making the discriminatory process only a little better than guesswork.

Abstract

CLUSTERING is a nonparametric method of arranging similar observations together, often in a graphical display used to detect patterns of grouping and outliers. The approach is usually considered nonparametric because there is no specified underlying distribution or model we need to assume.

`R`

offers great flexibility in graphical capability making these methods possible. The largest difference between these methods and those considered in the previous chapter is in this chapter we do not know group membership a priori, or whether in fact there are different groups at all. Similarly, part of the methods discussed here includes estimates of the number of dissimilar groups present in the data.Abstract

Longitudinal studies are a common form of studies including clinical trials where the treatment effect is visible only after several measurements are made on the same individual over a period of time. This chapter begins with an example of a randomized trial of an experimental medication.

Abstract

THE MODELS for data described so far have been concerned with independent observations on multivariate values. The data examined in this chapter are for settings where successive observations are also correlated. The subject matter is not usually associated with multivariate methods but our choice of applications makes these methods more relevant.

Abstract

THIS FINAL CHAPTER provides a collection of useful multivariate methods not fitting into any of the previous chapters. The Bradley–Terry model gives us a way to rank a set of objects examined by pairwise comparisons. Such examples include sports teams playing against each other. Canonical correlations generalize the definition of correlation of a pair of scalar-valued variates to two groups of several variables considered jointly. The study of extremes allows us to examine several of the largest values in a collection of data.