Top

2013 | Book

Applied Predictive Modeling

Authors: Max Kuhn, Kjell Johnson

Publisher: Springer New York

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This text is intended for a broad audience as both an introduction to predictive models as well as a guide to applying them. Non-mathematical readers will appreciate the intuitive explanations of the techniques while an emphasis on problem-solving with real data across a wide variety of applications will aid practitioners who wish to extend their expertise. Readers should have knowledge of basic statistical ideas, such as correlation and linear regression analysis. While the text is biased against complex equations, a mathematical background is needed for advanced topics.

Dr. Kuhn is a Director of Non-Clinical Statistics at Pfizer Global R&D in Groton Connecticut. He has been applying predictive models in the pharmaceutical and diagnostic industries for over 15 years and is the author of a number of R packages.

Dr. Johnson has more than a decade of statistical consulting and predictive modeling experience in pharmaceutical research and development. He is a co-founder of Arbor Analytics, a firm specializing in predictive modeling and is a former Director of Statistics at Pfizer Global R&D. His scholarly work centers on the application and development of statistical methodology and learning algorithms.

Applied Predictive Modeling covers the overall predictive modeling process, beginning with the crucial steps of data preprocessing, data splitting and foundations of model tuning. The text then provides intuitive explanations

of numerous common and modern regression and classification techniques, always with an emphasis on illustrating and solving real data problems. Addressing practical concerns extends beyond model fitting to topics such as handling class imbalance, selecting predictors, and pinpointing causes of poor model performance—all of which are problems that occur frequently in practice.

The text illustrates all parts of the modeling process through many hands-on, real-life examples. And every chapter contains extensive R code for each step of the process. The data sets and corresponding code are available in the book’s companion AppliedPredictiveModeling R package, which is freely available on the CRAN archive.

This multi-purpose text can be used as an introduction to predictive models and the overall modeling process, a practitioner’s reference handbook, or as a text for advanced undergraduate or graduate level predictive modeling courses. To that end, each chapter contains problem sets to help solidify the covered concepts and uses data available in the book’s R package.

Readers and students interested in implementing the methods should have some basic knowledge of R. And a handful of the more advanced topics require some mathematical knowledge.

Frontmatter

Chapter 1. Introduction

Abstract

Every day people are faced with questions such as “What route should I take to work today?” “Should I switch to a different cell phone carrier?” “How should I invest my money?” or “Will I get cancer?” These questions indicate our desire to know future events, and we earnestly want to make the best decisions towards that future. In this chapter we explore the contrast between the competing modeling objectives of prediction and interpretation (Section 1.1), outline the foundational components for developing predictive models (Section 1.2) and define common terminology (Section 1.3), and provide summaries of data sets that will be used throughout the book (Section 1.4). The chapter ends with an overview of the four parts of the book (Section 1.5), and notation used throughout the text (Section 1.6).

Max Kuhn, Kjell Johnson

General Strategies

Frontmatter

Chapter 2. A Short Tour of the Predictive Modeling Process

Abstract

To begin Part I of this work, we present a simple example that illustrates the broad concepts of model building. Section 2.1 provides an overview of a fuel economy data set for which the objective is to predict vehicles' fuel economy based on standard vehicle predictors such as engine displacement, number of cylinders, type of transmission, and manufacturer. In the context of this example, we explain the concepts of “spending” data, estimating model performance, building candidate models, and selecting the optimal model (Section 2.2).

Max Kuhn, Kjell Johnson

Chapter 3. Data Pre-processing

Abstract

Data preprocessing techniques generally refer to the addition, deletion, or transformation of the training set data. Preprocessing data is a crucial step prior to modeling since data preparation can make or break a model’s predictive ability. To illustrate general preprocessing techniques, we begin by introducing a cell segmentation data set (Section 3.1). This data set contains common predictor problems such as skewness, outliers, and missing values. Sections 3.2 and 3.3 review predictor transformations for single predictors and multiple predictors, respectively. In Section 3.4 we discuss several approaches for handling missing data. Other preprocessing steps may include removing (Section 3.5), adding (Section 3.6), or binning (Section 3.7) predictors, all of which must be done carefully so that predictive information is not lost or erroneous information is added to the data. The computing section (3.8) provides R syntax for the previously described preprocessing steps. Exercises are provided at the end of the chapter to solidify concepts.

Max Kuhn, Kjell Johnson

Chapter 4. Over-Fitting and Model Tuning

Abstract

Many modern classification and regression models are highly adaptable; they are capable of modeling complex relationships. Each model's adaptability is typically governed by a set of tuning parameters, which can allow each model to pinpoint predictive patterns and structures within the data. However, these tuning parameters can very identify predictive patterns that are not reproducible. This is known as “over-fitting.” Models that are over-fit generally have excellent predictivity for the samples on which they were built, but poor predictivity for new samples. Without a methodological approach to building and evaluating models, the modeler will not know if the model is over-fit until the next set of samples are predicted. In Section 4.1 we use a simple example to illustrate the problem of over-fitting. We then describe a systematic process for tuning models (Section 4.2), which is foundational to the remaining parts of the book. Core to model tuning are appropriate ways for splitting (or spending) the data, which is covered in Section 4.3. Resampling techniques (Section 4.4) are an alternative or complementary approach to data splitting. Recommendations for approaches to data splitting are provided in Section 4.7. After evaluating a number tuning parameters via data splitting or resampling, we must choose the final tuning parameters (Section 4.6). We also discuss how to choose the optimal model across several tuned models (Section 4.8) We illustrate how to implement the recommended techniques discussed in this chapter in the Computing Section (4.9). Exercises are provided at the end of the chapter to solidify concepts.

Max Kuhn, Kjell Johnson

Regression Models

Frontmatter

Chapter 5. Measuring Performance in Regression Models

Abstract

When predicting a numeric outcome, some measure of accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure accuracy, each with its own nuance. In Section 5.1 we define common measures for evaluating quantitative performance. We also discuss the concept of variance-bias trade-off (Section 5.2), and the implication of this principle for predictive modeling. In Section 5.3, we demonstrate how measures of predictive performance can be generated in R.

Max Kuhn, Kjell Johnson

Chapter 6. Linear Regression and Its Cousins

Abstract

In this chapter we discuss several models, all of which are akin to linear regression in that each can directly or indirectly be written in the widely know multiple linear regression form. We begin this chapter by describing a chemistry case study data set (Section 6.1) which will be used to illustrate models throughout this chapter as well as for Chapters 7-9. As a foundational model, we discuss ordinary linear regression (Section 6.2). Section 6.3 defines and illustrates partial least squares and its algorithmic and computational variations. Penalized models such as ridge regression, the lasso, and the elastic net are presented in Section 6.4. In the Computing Section (6.5) we demonstrate how to train each of these models in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 7. Nonlinear Regression Models

Abstract

Chapter 6 discussed regression models that were intrinsically linear. In this chapter we present regression models that are inherently nonlinear in nature. When using these models, the exact form of the nonlinearity does not need to be known explicitly or specified prior to model training. These models include neural networks (Section 7.1), multivariate adaptive regression splines (Section 7.2), support vector machines (Section 7.3), and K-nearest neighbors (Section 7.4). In the Computing Section (7.5) we demonstrate how to train each of these models in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 8. Regression Trees and Rule-Based Models

Abstract

Tree-based models consist of one or more nested if-then statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome. Regression trees and regression model trees are basic partitioning models and are covered in Sections 8.1 and 8.2, respectively. In Section 8.3, we present rule-based models, which are models governed by if-then conditions (possibly created by a tree) that have been collapsed into independent conditions. Rules can be simplified or pruned in a way that samples are covered by multiple rules. Ensemble methods combine many trees (or rule-based models) into one model and tend to have much better predictive performance than single tree- or rule-based model. Popular ensemble techniques are bagging (Section 8.4), random forests (Section 8.5), boosting (Section 8.6), and Cubist (Section 8.7). In the Computing Section (8.8), we demonstrate how to train each of these models in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 9. A Summary of Solubility Models

Abstract

In chapters 6-8, we developed a number of models to predict compounds’ solubility. In this chapter we compare and contrast the models’ performance and demonstrate how to select the optimal final model.

Max Kuhn, Kjell Johnson

Chapter 10. Case Study: Compressive Strength of Concrete Mixtures

Abstract

The data set used in Chapters 6-9 to illustrate the model building process was based on observational data: the samples were selected from a predefined population and the predictors and response were observed. The case study in the chapter is used to explain the model building process for data that emanate from a designed experiment. In a designed experiment, the predictors and their desired values are prespecified. The specific combinations of the predictor values are also prespecified, which determine the samples that will be collected for the data set. The experiment is then conducted and the response is observed. In the context model building for a designed experiment we present a strategy (Section 10.1), recommendations for evaluating model performance (Section 10.2), an approach for identifying predictor combinations that produce an optimal response (Section 10.3), and syntax for building and evaluating models for this illustration (Section 10.4).

Max Kuhn, Kjell Johnson

Classification Models

Frontmatter

Chapter 11. Measuring Performance in Classification Models

Abstract

When predicting a categorical outcome, some measure of classification accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure classification accuracy, depending of the modeler’s primary objectives. Most classification models can produce both a continuous and categorical prediction output. In Section 11.1, we review these outputs, demonstrate how to adjust probabilities based on calibration plots, recommend ways for displaying class predictions, and define equivocal or indeterminate zones of prediction. In Section 11.2, we review common metrics for assessing classification predictions such as accuracy, kappa, sensitivity, specificity, and positive and negative predicted values. This section also addresses model evaluation when costs are applied to making false positive or false negative mistakes. Classification models may also produce predicted classification probabilities. Evaluating this type of output is addressed in Section 11.3, and includes a discussion of receiver operating characteristic curves as well as lift charts. In Section 11.4, we demonstrate how measures of classification performance can be generated in R.

Max Kuhn, Kjell Johnson

Chapter 12. Discriminant Analysis and Other Linear Classification Models

Abstract

In this chapter we discuss models that classify samples using linear classification boundaries. We begin this chapter by describing a grant applications case study data set (Section 12.1) which will be used to illustrate models throughout this chapter as well as for Chapters 13-15. As foundational models, we discuss logistic regression (Section 12.2) and linear discriminant analysis (Section 12.3). In Section 12.4 we define and illustrates partial least squares discriminant analysis and its fundamental connection to linear discriminant analysis. Penalized models such as ridge penalty for logistic regression, glmnet, penalized linear discriminant analysis are discussed in Section 12.5. Nearest shrunken centroids, an approach tailored towards high dimensional data, is presented in Section 12.6. We demonstrate in the Computing Section (12.7) how to train each of these models in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 13. Nonlinear Classification Models

Abstract

Chapter 12 discussed classification models that defined linear classification boundaries. In this chapter we present models that generate nonlinear boundaries. We begin with explaining several generalizations to the linear discriminant analysis framework such as quadratic discriminant analysis, regularized discriminant analysis, and mixture discriminant analysis (Section 13.1). Other nonlinear classification models include neural networks (Section 13.2), flexible discriminant analysis (Section 13.3), support vector machines (Section 13.4), K-nearest neighbors (Section 13.5), and naive Bayes (Section 13.6). In the Computing Section (13.7) we demonstrate how to train each of these models in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 14. Classification Trees and Rule-Based Models

Abstract

Classification trees fall within the family of tree-based models and, similar to regression trees (Chapter 8), consist of nested if-then statements. Classification trees and rules are basic partitioning models and are covered in Sections 14.1 and 14.2, respectively. Ensemble methods combine many trees (or rules) into one model and tend to have much better predictive performance than single tree- or rule-based model. Popular ensemble techniques are bagging (Section 14.3), random forests (Section 14.4), boosting (Section 14.5), and C5.0 (Section 14.6). In Section 14.7 we compare the model results from two different encodings for the categorical predictors. Then in Section 14.8, we demonstrate how to train each of these models in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 15. A Summary of Grant Application Models

Abstract

Chapters 12-14, used a variety of different philosophies and techniques to predict grant-funding success. In this chapter we compare and contrast the models' performance on a specific test set and demonstrate how to select the optimal final model.

Max Kuhn, Kjell Johnson

Chapter 16. Remedies for Severe Class Imbalance

Abstract

When modeling discrete classes, the relative frequencies of the classes can have a significant impact on the effectiveness of the model. An imbalance occurs when one or more classes have very low proportions in the training data as compared to the other classes. Imbalance can be present in any data set or application, and hence, the practitioner should be aware of the implications of modeling this type of data. To illustrate the impacts and remedies for severe class imbalance, we present a case study example (Section 16.1) and the impact of class imbalance on performances measures (Section 16.2). Sections 16.3-16.6 describe approaches for handling imbalance using the existing data such as maximizing minority class accuracy, adjusting classification cut-offs or prior probabilities, or adjusting sample weights prior to model tuning. Handling imbalance can also be done through sophisticated up- or down-sampling methods (Section 16.7) or by applying costs to the classification errors (Section 16.8). In the Computing Section (16.9) we demonstrate how to implement these remedies in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 17. Case Study: Job Scheduling

Abstract

High-performance computing (HPC) environments are used by many technology and research organizations to facilitate large-scale computations. HPC systems typically use a job scheduling software which prioritizes jobs for submissions, manages the computational resources, and initiates submitted jobs to maximize efficiency. To assist the scheduler, data on execution times were collected and are used to classify new jobs into one of four classes (very fast, fast, moderate or long). In this chapter we illustrate the model tuning and evaluation process in this context. Here we present the data splitting and modeling strategy (Section 17.1), model results (Section 17.2), and corresponding computing code (Section 17.3).

Max Kuhn, Kjell Johnson

Other Considerations

Frontmatter

Chapter 18. Measuring Predictor Importance

Abstract

Many predictive models have built-in or intrinsic measurements of predictor importance and have been discussed in previous chapters. For example, multivariate adaptive regression splines and many tree-based models monitor the increase in performance that occurs when adding each predictor to the model. Others, such as linear regression or logistic regression can use quantifications based on the model coefficients or statistical measures. The methodologies discussed in this chapter are not specific to any predictive model and can be used with numeric (Section 18.1) or categorical (Section 18.2) outcomes. Other modern importance algorithms such as Relief and MIC are presented in Section 18.3. In the Computing Section (18.4) we demonstrate how to implement these remedies in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 19. An Introduction to Feature Selection

Abstract

Determining which predictors should be included in a model is becoming one of the most critical questions as data are becoming increasingly high-dimensional. The chapter demonstrates the negative effect of extra predictors on a number of models (Section 19.1), as well as discussing typical approaches to supervised feature selection such as wrapper and filter methods (Sections 19.2-19.4). The modeler should also be aware of the danger of selection bias and how to avoid it (Section 19.5). In Section 19.6 we present a case study to illustrate the feature selection methods. In the Computing Section (19.7) we demonstrate how to implement feature selection methodologies in R. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Chapter 20. Factors That Can Affect Model Performance

Abstract

Several of the preceding chapters have focused on technical pitfalls of predictive models, such as over-fitting and class imbalances. Often, true success may depend on aspects of the problem that are not directly related to the model itself. This chapter discusses topics such as: Type III errors (answering the wrong question, Section 20.1), the effect of unwanted noise in the response (Section 20.2) and in the predictors (Section 20.3), the impact of discretizing continuous outcomes (Section 20.4), extrapolation (Section 20.5), and the impact of a large number of samples (Section 20.6). In the Computing Section (20.7) we illustrate the implementation of an algorithm for determining samples’ similarity to the training set. Finally, exercises are provided at the end of the chapter to solidify the concepts.

Max Kuhn, Kjell Johnson

Backmatter

Title: Applied Predictive Modeling
Authors: Max Kuhn
Kjell Johnson
Publisher: Springer New York
Electronic ISBN: 978-1-4614-6849-3
Print ISBN: 978-1-4614-6848-6
DOI: https://doi.org/10.1007/978-1-4614-6849-3

Springer Professional

Applied Predictive Modeling

About this book

Table of Contents

Frontmatter

Chapter 1. Introduction

General Strategies

Frontmatter

Chapter 2. A Short Tour of the Predictive Modeling Process

Chapter 3. Data Pre-processing

Chapter 4. Over-Fitting and Model Tuning

Regression Models

Frontmatter

Chapter 5. Measuring Performance in Regression Models

Chapter 6. Linear Regression and Its Cousins

Chapter 7. Nonlinear Regression Models

Chapter 8. Regression Trees and Rule-Based Models

Chapter 9. A Summary of Solubility Models

Chapter 10. Case Study: Compressive Strength of Concrete Mixtures

Classification Models

Frontmatter

Chapter 11. Measuring Performance in Classification Models

Chapter 12. Discriminant Analysis and Other Linear Classification Models

Chapter 13. Nonlinear Classification Models

Chapter 14. Classification Trees and Rule-Based Models

Chapter 15. A Summary of Grant Application Models

Chapter 16. Remedies for Severe Class Imbalance

Chapter 17. Case Study: Job Scheduling

Other Considerations

Frontmatter

Chapter 18. Measuring Predictor Importance

Chapter 19. An Introduction to Feature Selection

Chapter 20. Factors That Can Affect Model Performance

Backmatter

Premium Partner