Skip to main content

2020 | Buch

Random Forests with R

insite
SUCHEN

Über dieses Buch

This book offers an application-oriented guide to random forests: a statistical learning method extensively used in many fields of application, thanks to its excellent predictive performance, but also to its flexibility, which places few restrictions on the nature of the data used. Indeed, random forests can be adapted to both supervised classification problems and regression problems. In addition, they allow us to consider qualitative and quantitative explanatory variables together, without pre-processing. Moreover, they can be used to process standard data for which the number of observations is higher than the number of variables, while also performing very well in the high dimensional case, where the number of variables is quite large in comparison to the number of observations. Consequently, they are now among the preferred methods in the toolbox of statisticians and data scientists. The book is primarily intended for students in academic fields such as statistical education, but also for practitioners in statistics and machine learning. A scientific undergraduate degree is quite sufficient to take full advantage of the concepts, methods, and tools discussed. In terms of computer science skills, little background knowledge is required, though an introduction to the R language is recommended.

Random forests are part of the family of tree-based methods; accordingly, after an introductory chapter, Chapter 2 presents CART trees. The next three chapters are devoted to random forests. They focus on their presentation (Chapter 3), on the variable importance tool (Chapter 4), and on the variable selection problem (Chapter 5), respectively. After discussing the concepts and methods, we illustrate their implementation on a running example. Then, various complements are provided before examining additional examples. Throughout the book, each result is given together with the code (in R) that can be used to reproduce it. Thus, the book offers readers essential information and concepts, together with examples and the software tools needed to analyse data using random forests.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction to Random Forests with R
Abstract
The two algorithms discussed in this book were proposed by Leo Breiman: CART trees, which were introduced in the mid-1980s, and random forests, which emerged just under 20 years later in the early 2000s. This chapter offers an introduction to the subject matter, beginning with a historical overview. Some notations, used to define the various statistical objectives addressed in the book, are also introduced: classification, regression, prediction, and variable selection. In turn, the three R packages used in the book are listed, and some competitors are mentioned. Lastly, the four datasets used to illustrate the methods’ application are presented: the running example (spam), a genomic dataset, and two pollution datasets (ozone and dust).
Robin Genuer, Jean-Michel Poggi
Chapter 2. CART
Abstract
CART stands for Classification And Regression Trees, and refers to a statistical method for constructing tree predictors (also called decision trees) for both regression and classification problems. This chapter focuses on CART trees, analyzing in detail the two steps involved in their construction: the maximal tree growing algorithm, which produces a large family of models, and the pruning algorithm, which is used to select an optimal or suitable final one. The construction is illustrated on the spam dataset using the rpart package. The chapter then addresses interpretability issues and how to use competing and surrogate splits. In the final section, trees are applied to two examples: predicting ozone concentration and analyzing genomic data.
Robin Genuer, Jean-Michel Poggi
Chapter 3. Random Forests
Abstract
The general principle of random forests is to aggregate a collection of random decision trees. The goal is, instead of seeking to optimize a predictor “at once” as for a CART tree, to pool a set of predictors (not necessarily optimal). Since individual trees are randomly perturbed, the forest benefits from a more extensive exploration of the space of all possible tree predictors, which, in practice, results in better predictive performance. Focusing on random forests, this chapter begins by addressing the instability of a tree and subsequently introduces readers to two random forest variants: Bagging and Random Forest Random Inputs. The construction of random forests is illustrated on the spam dataset using the randomForest package. The clever prediction error estimate Out-Of-Bag Error is also presented. In turn, the chapter assesses the sensitivity of prediction performance to the two main parameters: the number of trees and the number of variables picked at each node. In the final section, random forests are applied to three examples: predicting ozone concentration, analyzing genomic data, and analyzing dust pollution.
Robin Genuer, Jean-Michel Poggi
Chapter 4. Variable Importance
Abstract
Here, the focus is on creating a hierarchy of input variables, based on a quantification of the importance of their effects on the response variable. Such an index of importance provides a ranking of variables. Random forests offer an ideal framework, as they do not make any assumptions regarding the underlying model. This chapter introduces permutation variable importance using random forests and illustrates its use on the spam dataset. The behavior of the variable importance index is first studied with regard to data-related aspects: the number of observations, number of variables, and presence of groups of correlated variables. Then, its behavior with regard to random forest parameters is addressed. In the final section, the use of variable importance is first illustrated by simulation in regression, and then in three examples: predicting ozone concentration, analyzing genomic data, and determining the local level of dust pollution.
Robin Genuer, Jean-Michel Poggi
Chapter 5. Variable Selection
Abstract
This chapter is dedicated to variable selection using random forests: an automatic three-step procedure involving first a fairly coarse elimination of a large number of useless variables, followed by a finer and ascending sequential introduction of variables into random forest models, for interpretation and then for prediction. The principle and the procedure implemented in the VSURF package are presented on the spam dataset. The choice of VSURF parameters suitable for selection is then studied. In the final section, the variable selection procedure is applied to two real examples: predicting ozone concentration and analyzing genomic data.
Robin Genuer, Jean-Michel Poggi
Backmatter
Metadaten
Titel
Random Forests with R
verfasst von
Robin Genuer
Jean-Michel Poggi
Copyright-Jahr
2020
Electronic ISBN
978-3-030-56485-8
Print ISBN
978-3-030-56484-1
DOI
https://doi.org/10.1007/978-3-030-56485-8