nach oben

2020 | Buch

Kapitel lesen Erstes Kapitel lesen

Chemometrics with R

Multivariate Data Analysis in the Natural and Life Sciences

verfasst von: Ron Wehrens

Verlag: Springer Berlin Heidelberg

Buchreihe : Use R!

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book offers readers an accessible introduction to the world of multivariate statistics in the life sciences, providing a comprehensive description of the general data analysis paradigm, from exploratory analysis (principal component analysis, self-organizing maps and clustering) to modeling (classification, regression) and validation (including variable selection). It also includes a special section discussing several more specific topics in the area of chemometrics, such as outlier detection, and biomarker identification. The corresponding R code is provided for all the examples in the book; and scripts, functions and data are available in a separate R package. This second revised edition features not only updates on many of the topics covered, but also several sections of new material (e.g., on handling missing values in PCA, multivariate process monitoring and batch correction).

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract

In the last twenty years, the life sciences have seen a dramatic increase in the size and number of data sets. Simple sensing devices in many cases offer real-time data streaming but also more complicated data types such as spectra or images can be acquired at much higher rates then was thought possible not too long ago.

Ron Wehrens

Preliminaries

Frontmatter

Chapter 2. Data

Abstract

In this chapter some data sets are presented that will be used throughout the book. In a couple of places (in particular in Chap. 11) other sets will be discussed focusing on particular analysis aspects. All data sets are accessible, either through one of the packages mentioned in the text, or in the ChemometricsWithR package. In addition to a short description, the data will be visualized to get an idea of their form and characteristics—one cannot stress enough how important it is to eyeball the data, not only through convenient summaries but also in their raw form!

Ron Wehrens

Chapter 3. Preprocessing

Abstract

Textbook examples typically use clean, perfect data, allowing the techniques of interest to be explained and illustrated. However, in real life data are messy, noisy, incomplete, downright faulty, or a combination of these. The first step in any data analysis often consists of preprocessing to assess and possibly improve data quality. This step may actually take more time than the analysis itself, and more often than not the process consists of an iterative procedure where data preprocessing steps are alternated with data analysis steps.

Ron Wehrens

Exploratory Analysis

Frontmatter

Chapter 4. Principal Component Analysis

Abstract

Principal Component Analysis or PCA (Jackson 1991; Jolliffe 1986) is a technique which, quite literally, takes a different viewpoint of multivariate data. It has many uses, perhaps the most important of which is the possibility to provide simple two-dimensional plots of high-dimensional data. This way, one can easily assess the presence of grouping or outliers, and more generally obtain an idea of how samples and variables relate to each other.

Ron Wehrens

Chapter 5. Self-Organizing Maps

Abstract

In PCA, the most outlying data points determine the direction of the PCs—these are the ones contributing most to the variance. This often results in score plots showing a large group of points close to the center.

Ron Wehrens

Chapter 6. Clustering

Abstract

As we saw earlier in the visualizations provided by methods like PCA and SOM, it is often interesting to look for structure, or groupings, in the data. However, these methods do not explicitly define clusters; that is left to the pattern recognition capabilities of the scientist studying the plot.

Ron Wehrens

Modelling

Frontmatter

Chapter 7. Classification

Abstract

The goal of classification, also known as supervised pattern recognition, is to provide a model that yields the optimal discrimination between several classes in terms of predictive performance.

Ron Wehrens

Chapter 8. Multivariate Regression

Abstract

In Chaps. 6 and 7 we have concentrated on finding groups in data, or, given a grouping, creating a predictive model for new data. The last situation is “supervised” in the sense that we use a set of examples with known class labels, the training set, to build the model. In this chapter we will do something similar—now we are not predicting a discrete class property but rather a continuous variable. Put differently: given a set of independent real-valued variables (matrix \({\varvec{X}}\)), we want to build a model that allows prediction of \({\varvec{Y}}\), consisting of one, or possibly more, real-valued dependent variables. As in almost all regression cases, we here assume that errors, normally distributed with constant variance, are only present in the dependent variables, or at least are so much larger in the dependent variables that errors in the independent variables can be ignored. Of course, we also would like to have an estimate of the expected error in predictions for future data.

Ron Wehrens

Model Inspection

Frontmatter

Chapter 9. Validation

Abstract

Validation is the assessment of the quality of a predictive model, in accordance with the scientific paradigm in the natural sciences: a model that is able to make accurate predictions (the position of a planet in two weeks’ time) is—in some sense—a “correct” description of reality. In many applications in the natural sciences, unfortunately, validation is hard to do: chemical and biological processes often exhibit quite significant variation unrelated to the model parameters. An example is the circadian rhythm: metabolomic samples, be it from animals or plants, will show very different characteristics when taken at different time points. When the experimental meta-data on the exact time point of sampling are missing, it will be very hard to ascribe differences in metabolite levels to differences between patients and controls, or different varieties of the same plant. Only a rigorous and consistent experimental design will be able to prevent this kind of fluctuations. Moreover, biological variation between individuals often dominates measurement variation. The bigger the variation, the more important it is to have enough samples for validation. Only in this way, reliable error estimates can be obtained.

Ron Wehrens

Chapter 10. Variable Selection

Abstract

Variable selection is an important topic in many types of multivariate modelling: the choice which variables to take into account to a large degree determines the result.

Ron Wehrens

Applications

Frontmatter

Chapter 11. Chemometric Applications

Abstract

This chapter highlights some typical examples of research themes in the chemometrics community. Up to now we have concentrated on fairly general techniques, found in many textbooks and applicable in a wide range of fields. The topics in this chapter are more specific to the field of chemometrics, combining elements from the previous chapters. In particular, latent-variable approaches like PCA and PLS exhibit a wide range of applications (some people have criticized the field of chemometrics of being too preoccupied with latent-variable methods, and not without reason—on the other hand such tools are extremely handy in many different situations).

Ron Wehrens

Backmatter

Titel: Chemometrics with R
verfasst von: Ron Wehrens
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-662-62027-4
Print ISBN: 978-3-662-62026-7
DOI: https://doi.org/10.1007/978-3-662-62027-4