Skip to main content

2015 | Buch

Regression Modeling Strategies

With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis

insite
SUCHEN

Über dieses Buch

This highly anticipated second edition features new chapters and sections, 225 new references, and comprehensive R software. In keeping with the previous edition, this book is about the art and science of data analysis and predictive modeling, which entails choosing and using multiple tools. Instead of presenting isolated techniques, this text emphasizes problem solving strategies that address the many issues arising when developing multivariable models using real data and not standard textbook examples. It includes imputation methods for dealing with missing data effectively, methods for fitting nonlinear relationships and for making the estimation of transformations a formal part of the modeling process, methods for dealing with "too many variables to analyze and not enough observations," and powerful model validation techniques based on the bootstrap. The reader will gain a keen understanding of predictive accuracy and the harm of categorizing continuous predictors or outcomes. This text realistically deals with model uncertainty and its effects on inference, to achieve "safe data mining." It also presents many graphical methods for communicating complex regression models to non-statisticians.

Regression Modeling Strategies presents full-scale case studies of non-trivial datasets instead of over-simplified illustrations of each method. These case studies use freely available R functions that make the multiple imputation, model building, validation and interpretation tasks described in the book relatively easy to do. Most of the methods in this text apply to all regression models, but special emphasis is given to multiple regression using generalized least squares for longitudinal data, the binary logistic model, models for ordinal responses, parametric survival regression models and the Cox semi parametric survival model. A new emphasis is given to the robust analysis of continuous dependent variables using ordinal regression.

As in the

first edition, this text is intended for Masters' or Ph.D. level graduate students who have had a general introductory probability and statistics course and who are well versed in ordinary multiple regression and intermediate algebra. The book will also serve as a reference for data analysts and statistical methodologists, as it contains an up-to-date survey and bibliography of modern statistical modeling techniques. Examples used in the text mostly come from biomedical research, but the methods are applicable anywhere predictive models ("analytics") are useful, including economics, epidemiology, sociology, psychology, engineering and marketing.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
Statistics comprises among other areas study design, hypothesis testing, estimation, and prediction. This text aims at the last area, by presenting methods that enable an analyst to develop models that will make accurate predictions of responses for future observations. Prediction could be considered a superset of hypothesis testing and estimation, so the methods presented here will also assist the analyst in those areas. It is worth pausing to explain how this is so.
Frank E. Harrell Jr.
Chapter 2. General Aspects of Fitting Regression Models
Abstract
The ordinary multiple linear regression model is frequently used and has parameters that are easily interpreted. In this chapter we study a general class of regression models, those stated in terms of a weighted sum of a set of independent or predictor variables. It is shown that after linearizing the model with respect to the predictor variables, the parameters in such regression models are also readily interpreted. Also, all the designs used in ordinary linear regression can be used in this general setting. These designs include analysis of variance (ANOVA) setups, interaction effects, and nonlinear effects. Besides describing and interpreting general regression models, this chapter also describes, in general terms, how the three types of assumptions of regression models can be examined.
Frank E. Harrell Jr.
Chapter 3. Missing Data
Abstract
There are missing data in the majority of datasets one is likely to encounter. Before discussing some of the problems of analyzing data in which some variables are missing for some subjects, we define some nomenclature.
Frank E. Harrell Jr.
Chapter 4. Multivariable Modeling Strategies
Abstract
Chapter 2 dealt with aspects of modeling such as transformations of predictors, relaxing linearity assumptions, modeling interactions, and examining lack of fit. Chapter 3 dealt with missing data, focusing on utilization of incomplete predictor information. All of these areas are important in the overall scheme of model development, and they cannot be separated from what is to follow. In this chapter we concern ourselves with issues related to the whole model, with emphasis on deciding on the amount of complexity to allow in the model and on dealing with large numbers of predictors. The chapter concludes with three default modeling strategies depending on whether the goal is prediction, estimation, or hypothesis testing.
Frank E. Harrell Jr.
Chapter 5. Describing, Resampling, Validating, and Simplifying the Model
Abstract
Before addressing issues related to describing and interpreting the model and its coefficients, one can never apply too much caution in attempting to interpret results in a causal manner. Regression models are excellent tools for estimating and inferring associations between an X and Y given that the “right” variables are in the model. Any ability of a model to provide causal inference rests entirely on the faith of the analyst in the experimental design, completeness of the set of variables that are thought to measure confounding and are used for adjustment when the experiment is not randomized, lack of important measurement error, and lastly the goodness of fit of the model.
Frank E. Harrell Jr.
Chapter 6. R Software
Abstract
The methods described in this book are useful in any regression model that involves a linear combination of regression parameters. The software that is described below is useful in the same situations. Functions in R 520 allow interaction spline functions as well as a wide variety of predictor parameterizations for any regression function, and facilitate model validation by resampling.
R is the most comprehensive tool for general regression models for the following reasons.
Frank E. Harrell Jr.
Chapter 7. Modeling Longitudinal Responses using Generalized Least Squares
Abstract
In this chapter we consider models for a multivariate response variable represented by serial measurements over time within subject. This setup induces correlations between measurements on the same subject that must be taken into account to have optimal model fits and honest inference. Full likelihood model-based approaches have advantages including (1) optimal handling of imbalanced data and (2) robustness to missing data (dropouts) that occur not completely at random. The three most popular model-based full likelihood approaches are mixed effects models, generalized least squares, and Bayesian hierarchical models.
Frank E. Harrell Jr.
Chapter 8. Case Study in Data Reduction
Abstract
Recall that the aim of data reduction is to reduce (without using the outcome) the number of parameters needed in the outcome model.
Frank E. Harrell Jr.
Chapter 9. Overview of Maximum Likelihood Estimation
Abstract
In ordinary least squares multiple regression, the objective in fitting a model is to find the values of the unknown parameters that minimize the sum of squared errors of prediction. When the response variable is non-normal, polytomous, or not observed completely, one needs a more general objective function to optimize.
Frank E. Harrell Jr.
Chapter 10. Binary Logistic Regression
Abstract
Binary responses are commonly studied in many fields. Examples include1 the presence or absence of a particular disease, death during surgery, or a consumer purchasing a product. Often one wishes to study how a set of predictor variables X is related to a dichotomous response variable Y. The predictors may describe such quantities as treatment assignment, dosage, risk factors, and calendar time. For convenience we define the response to be Y = 0 or 1, with Y = 1 denoting the occurrence of the event of interest. Often a dichotomous outcome can be studied by calculating certain proportions, for example, the proportion of deaths among females and the proportion among males. However, in many situations, there are multiple descriptors, or one or more of the descriptors are continuous. Without a statistical model, studying patterns such as the relationship between age and occurrence of a disease, for example, would require the creation of arbitrary age groups to allow estimation of disease prevalence as a function of age.
Frank E. Harrell Jr.
Chapter 11. Case Study in Binary Logistic Regression, Model Selection and Approximation: Predicting Cause of Death
Abstract
This chapter contains a case study on developing, describing, and validating a binary logistic regression model. In addition, the following methods are exemplified:
Frank E. Harrell Jr.
Chapter 12. Logistic Model Case Study 2: Survival of Titanic Passengers
Abstract
This case study demonstrates the development of a binary logistic regression model to describe patterns of survival in passengers on the Titanic , based on passenger age, sex, ticket class, and the number of family members accompanying each passenger. Nonparametric regression is also used. Since many of the passengers had missing ages, multiple imputation is used so that the complete information on the other variables can be efficiently utilized. Titanic passenger data were gathered by many researchers. Primary references are the Encyclopedia Titanica at www.encyclopedia-titanica.org and Eaton and Haas. 169 Titanic survival patterns have been analyzed previously 151, 296, 571 but without incorporation of individual passenger ages. Thomas Cason while a University of Virginia student compiled and interpreted the data from the World Wide Web. One thousand three hundred nine of the passengers are represented in the dataset, which is available from this text’s Web site under the name titanic3. An early analysis of Titanic data may be found in Bron 75.
Frank E. Harrell Jr.
Chapter 13. Ordinal Logistic Regression
Abstract
Many medical and epidemiologic studies incorporate an ordinal response variable. In some cases an ordinal response Y represents levels of a standard measurement scale such as severity of pain (none, mild, moderate, severe). In other cases, ordinal responses are constructed by specifying a hierarchy of separate endpoints. For example, clinicians may specify an ordering of the severity of several component events and assign patients to the worst event present from among none, heart attack, disabling stroke, and death. Still another use of ordinal response methods is the application of rank-based methods to continuous responses so as to obtain robust inferences. For example, the proportional odds model described later allows for a continuous Y and is really a generalization of the Wilcoxon–Mann–Whitney rank test. Thus the semiparametric proportional odds model is a direct competitor of ordinary linear models.
Frank E. Harrell Jr.
Chapter 14. Case Study in Ordinal Regression, Data Reduction, and Penalization
Abstract
This case study is taken from Harrell et al. 272 which described a World Health Organization study 439 in which vital signs and a large number of clinical signs and symptoms were used to develop a predictive model for an ordinal response. This response consists of laboratory assessments of diagnosis and severity of illness related to pneumonia, meningitis, and sepsis.
Frank E. Harrell Jr.
Chapter 15. Regression Models for Continuous Y and Case Study in Ordinal Regression
Abstract
This chapter concerns univariate continuous Y. There are many multivariable models for predicting such response variables, such as
Frank E. Harrell Jr.
Chapter 16. Transform-Both-Sides Regression
Abstract
Fitting multiple regression models by the method of least squares is one of the most commonly used methods in statistics. There are a number of challenges to the use of least squares, even when it is only used for estimation and not inference, including the following. Fitting multiple regression models by the method of least squares is one of the most commonly used methods in statistics. There are a number of challenges to the use of least squares, even when it is only used for estimation and not inference, including the following.
1.
How should continuous predictors be transformed so as to get a good fit?
 
2.
Is it better to transform the response variable? How does one find a good transformation that simplifies the right-hand side of the equation?
 
3.
What if Y needs to be transformed non-monotonically (e.g., | Y − 100 | ) before it will have any correlation with X?
 
Frank E. Harrell Jr.
Chapter 17. Introduction to Survival Analysis
Abstract
Suppose that one wished to study the occurrence of some event in a population of subjects. If the time until the occurrence of the event were unimportant, the event could be analyzed as a binary outcome using the logistic regression model. For example, in analyzing mortality associated with open heart surgery, it may not matter whether a patient dies during the procedure or he dies after being in a coma for two months. For other outcomes, especially those concerned with chronic conditions, the time until the event is important. In a study of emphysema, death at eight years after onset of symptoms is different from death at six months. An analysis that simply counted the number of deaths would be discarding valuable information and sacrificing statistical power.
Frank E. Harrell Jr.
Chapter 18. Parametric Survival Models
Abstract
The nonparametric estimator of S(t) is a very good descriptive statistic for displaying survival data. For many purposes, however, one may want to make more assumptions to allow the data to be modeled in more detail. By specifying a functional form for S(t) and estimating any unknown parameters in this function, one can
1.
easily compute selected quantiles of the survival distribution;
 
2.
estimate (usually by extrapolation) the expected failure time;
 
3.
derive a concise equation and smooth function for estimating S(t), Λ(t), and λ(t); and
 
4.
estimate S(t) more precisely than S KM(t) or S Λ (t) if the parametric form is correctly specified.
 
Frank E. Harrell Jr.
Chapter 19. Case Study in Parametric Survival Modeling and Model Approximation
Abstract
Consider the random sample of 1000 patients from the SUPPORT study, 352 described in Section 3.12 In this case study we develop a parametric survival time model (accelerated failure time model) for time until death for the acute disease subset of SUPPORT (acute respiratory failure, multiple organ system failure, coma). We eliminate the chronic disease categories because the shapes of the survival curves are different between acute and chronic disease categories. To fit both acute and chronic disease classes would require a log-normal model with σ parameter that is disease-specific.
Frank E. Harrell Jr.
Chapter 20. Cox Proportional Hazards Regression Model
Abstract
The Cox proportional hazards model 132 is the most popular model for the analysis of survival data. It is a semiparametric model; it makes a parametric assumption concerning the effect of the predictors on the hazard function, but makes no assumption regarding the nature of the hazard function λ(t) itself. The Cox PH model assumes that predictors act multiplicatively on the hazard function but does not assume that the hazard function is constant (i.e., exponential model), Weibull, or any other particular form. The regression portion of the model is fully parametric; that is, the regressors are linearly related to log hazard or log cumulative hazard. In many situations, either the form of the true hazard function is unknown or it is complex, so the Cox model has definite advantages. Also, one is usually more interested in the effects of the predictors than in the shape of λ(t), and the Cox approach allows the analyst to essentially ignore λ(t), which is often not of primary interest.
Frank E. Harrell Jr.
Chapter 21. Case Study in Cox Regression
Abstract
Consider the randomized trial of estrogen for treatment of prostate cancer 87 described in Chapter 8 Let us now develop a model for time until death (of any cause). There are 354 deaths among the 502 patients. To be able to efficiently estimate treatment benefit, to test for differential treatment effect, or to estimate prognosis or absolute treatment benefit for individual patients, we need a multivariable survival model. In this case study we do not make use of data reductions obtained in Chapter 8 but show simpler (partial) approaches to data reduction. We do use the transcan results for imputation.
Frank E. Harrell Jr.
Backmatter
Metadaten
Titel
Regression Modeling Strategies
verfasst von
Frank E. Harrell , Jr.
Copyright-Jahr
2015
Electronic ISBN
978-3-319-19425-7
Print ISBN
978-3-319-19424-0
DOI
https://doi.org/10.1007/978-3-319-19425-7

Premium Partner