Skip to main content

2001 | Buch

Regression Modeling Strategies

With Applications to Linear Models, Logistic Regression, and Survival Analysis

verfasst von: Frank E. Harrell Jr.

Verlag: Springer New York

Buchreihe : Springer Series in Statistics

insite
SUCHEN

Über dieses Buch

Many texts are excellent sources of knowledge about individual statistical tools, but the art of data analysis is about choosing and using multiple tools. Instead of presenting isolated techniques, this text emphasizes problem solving strategies that address the many issues arising when developing multivariable models using real data and not standard textbook examples. It includes imputation methods for dealing with missing data effectively, methods for dealing with nonlinear relationships and for making the estimation of transformations a formal part of the modeling process, methods for dealing with "too many variables to analyze and not enough observations," and powerful model validation techniques based on the bootstrap. This text realistically deals with model uncertainty and its effects on inference to achieve "safe data mining".

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
Statistics comprises among other areas study design, hypothesis testing, estimation, and prediction. This text aims at the last area, by presenting methods that enable an analyst to develop models that will make accurate predictions of responses for future observations. Prediction could be considered a superset of hypothesis testing and estimation, so the methods presented here will also assist the analyst in those areas. It is worth pausing to explain how this is so.
Frank E. Harrell Jr.
Chapter 2. General Aspects of Fitting Regression Models
Abstract
The ordinary multiple linear regression model is frequently used and has parameters that are easily interpreted. In this chapter we study a general class of regression models, those stated in terms of a weighted sum of a set of independent or predictor variables. It is shown that after linearizing the model with respect to the predictor variables, the parameters in such regression models are also readily interpreted. Also, all the designs used in ordinary linear regression can be used in this general setting. These designs include analysis of variance (ANOVA) setups, interaction effects, and nonlinear effects. Besides describing and interpreting general regression models, this chapter also describes, in general terms, how the three types of assumptions of regression models can be examined.
Frank E. Harrell Jr.
Chapter 3. Missing Data
Abstract
There are missing data in the majority of datasets one is likely to encounter. Before discussing some of the problems of analyzing data in which some variables are missing for some subjects, we define some nomenclature.
Frank E. Harrell Jr.
Chapter 4. Multivariable Modeling Strategies
Abstract
Chapter 2 dealt with aspects of modeling such as transformations of predictors, relaxing linearity assumptions, modeling interactions, and examining lack of fit. Chapter 3 dealt with missing data, focusing on utilization of incomplete predictor information. All of these areas are important in the overall scheme of model development, and they cannot be separated from what is to follow. In this chapter we concern ourselves with issues related to the whole model, with emphasis on deciding on the amount of complexity to allow in the model and on dealing with large numbers of predictors. The chapter concludes with three default modeling strategies depending on whether the goal is prediction, estimation, or hypothesis testing.
Frank E. Harrell Jr.
Chapter 5. Resampling, Validating, Describing, and Simplifying the Model
Abstract
When one assumes that a random variable Y has a certain population distribution, one can use simulation or analytic derivations to study how a statistical estimator computed from samples from this distribution behaves. For example, when Y has a log-normal distribution, the variance of the sample median for a sample of size n from that distribution can be derived analytically. Alternatively, one can simulate 500 samples of size n from the log-normal distribution, compute the sample median for each sample, and then compute the sample variance of the 500 sample medians. Either case requires knowledge of the population distribution function.
Frank E. Harrell Jr.
Chapter 6. S-Plus Software
Abstract
The methods described in this book are useful in any regression model that involves a linear combination of regression parameters. The software that is described below is useful in the same situations. S30,308 functions allow interaction spline functions as well as a wide variety of predictor parameterizations for any regression function, and facilitate model validation by resampling.
Frank E. Harrell Jr.
Chapter 7. Case Study in Least Squares Fitting and Interpretation of a Linear Model
Abstract
This chapter presents some of the stages of modeling, using a linear multiple regression model whose coefficients are estimated using ordinary least squares. The data are taken from the 1994 version of the City and County Databook compiled by the Geospatial and Statistical Data Center of the University of Virginia Library and available at fisher.lib.virginia.edu/ccdb. Most of the variables come from the U.S. Censusa. Variables related to the 1992 U.S. presidential election were originally provided and copyrighted by the Elections Research Center and are taken from [365], with permission from the Copyright Clearance Center. The data extract analyzed here is available from this text’s Web site (see Appendix). The data did not contain election results from the 25 counties of Alaska. In addition, two other counties had zero voters in 1992. For these the percent voting for each of the candidates was also set to missing. The 27 counties with missing percent votes were excluded when fitting the multivariable model.
Frank E. Harrell Jr.
Chapter 8. Case Study in Imputation and Data Reduction
Abstract
The following case study illustrates these techniques:
1.
missing data imputation using mean substitution, recursive partitioning, and customized regressions;
 
2.
variable clustering;
 
3.
data reduction using principal components analysis and pretransformations;
 
4.
restricted cubic spline fitting using ordinary least squares, in the context of scaling; and
 
5.
scaling/variable transformations using canonical variates and nonparametric additive regression.
 
Frank E. Harrell Jr.
Chapter 9. Overview of Maximum Likelihood Estimation
Abstract
In ordinary least squares multiple regression, the objective in fitting a model is to find the values of the unknown parameters that minimize the sum of squared errors of prediction. When the response variable is polytomous or is not observed completely, a more general objective to optimize is needed.
Frank E. Harrell Jr.
Chapter 10. Binary Logistic Regression
Abstract
Binary responses are commonly studied in medical and epidemiologic research, for example, the presence or absence of a particular disease, death during surgery, and occurrence of ventricular fibrillation in a dog. Often one wishes to study how a set of predictor variables X is related to a dichotomous response variable Y. The predictors may describe such quantities as treatment assignment, dosage, risk factors, and year of treatment.
Frank E. Harrell Jr.
Chapter 11. Logistic Model Case Study 1: Predicting Cause of Death
Abstract
Consider the randomized trial of estrogen for treatment of prostate cancer60 described in Chapter 8. In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. Kay233 did a formal analysis of the competing risks for cancer, cardiovascular, and other deaths. It can also be quite informative to study how treatment and baseline variables relate to the cause of death for those patients who died.258 We subset the original dataset of those patients dying from prostate cancer (n = 130), heart or vascular disease (n = 96), or cerebrovascular disease (n = 31). Our goal is to predict cardiovascular death (cvd, n = 127) given the patient died from either cvd or prostate cancer. Of interest is whether the time to death has an effect on the cause of death, and whether the importance of certain variables depends on the time of death. We also need to formally test whether the data reductions and pretransformations in Chapter 8 are adequate for predicting cause of death.
Frank E. Harrell Jr.
Chapter 12. Logistic Model Case Study 2: Survival of Titanic Passengers
Abstract
This case study demonstrates the development of a binary logistic regression model to describe patterns of survival in passengers on the Titanic, based on passenger age, sex, ticket class, and the number of family members accompanying each passenger. Nonparametric regression is also used. Since many of the passengers had missing ages, multiple imputation is used so that the complete information on the other variables can be efficiently utilized.
Frank E. Harrell Jr.
Chapter 13. Ordinal Logistic Regression
Abstract
Many medical and epidemiologic studies incorporate an ordinal response variable. In some cases an ordinal response Y represents levels of a standard measurement scale such as severity of pain (none, mild, moderate, severe). In other cases, ordinal responses are constructed by specifying a hierarchy of separate endpoints. For example, clinicians may specify an ordering of the severity of several component events and assign patients to the worst event present from among none, heart attack, disabling stroke, and death. Still another use of ordinal response methods is the application of rank-based methods to continuous responses so as to obtain robust inferences. For example, the proportional odds model described later allows for a continuous Y and is really a generalization of the Wilcoxon-Mann-Whitney rank test.
Frank E. Harrell Jr.
Chapter 14. Case Study in Ordinal Regression, Data Reduction, and Penalization
Abstract
This case study is taken from Harrell et al.188 which described a World Health Organization study303 in which vital signs and a large number of clinical signs and symptoms were used to develop a predictive model for an ordinal response. This response consists of laboratory assessments of diagnosis and severity of illness related to pneumonia, meningitis, and sepsis. Much of the modeling strategy given in Chapter 4 was used to develop the model, with additional emphasis on penalized maximum likelihood estimation (Section 9.10). The following laboratory data are used in the response: cerebrospinal fluid (CSF) culture from a lumbar puncture (LP), blood culture (BC), arterial oxygen saturation (SaO 2, a measure of lung dysfunction), and chest X-ray (CXR). The sample consisted of 4552 infants aged 90 days or less.
Frank E. Harrell Jr.
Chapter 15. Models Using Nonparametric Transformations of X and Y
Abstract
Fitting multiple regression models by the method of least squares is one of the most commonly used methods in statistics. There are a number of challenges to the use of least squares, even when it is only used for estimation and not inference, including the following.
Frank E. Harrell Jr.
Chapter 16. Introduction to Survival Analysis
Abstract
Suppose that one wished to study the occurrence of some event in a population of subjects. If the time until the occurrence of the event were unimportant, the event could be analyzed as a binary outcome using the logistic regression model. For example, in analyzing mortality associated with open heart surgery, it may not matter whether a patient dies during the procedure or he dies after being in a coma for two months. For other outcomes, especially those concerned with chronic conditions, the time until the event is important. In a study of emphysema, death at eight years after onset of symptoms is different from death at six months. An analysis that simply counted the number of deaths would be discarding valuable information and sacrificing statistical power.
Frank E. Harrell Jr.
Chapter 17. Parametric Survival Models
Abstract
The nonparametric estimator of S(t) is a very good descriptive statistic for displaying survival data. For many purposes, however, one may want to make more assumptions to allow the data to be modeled in more detail. By specifying a functional form for S(t) and estimating any unknown parameters in this function, one can
1.
easily compute selected quantiles of the survival distribution;
 
2.
estimate (usually by extrapolation) the expected failure time;
 
3.
derive a concise equation and smooth function for estimating S(t), Λ(t), and λ(t); and
 
4.
estimate S(t) more precisely than S KM(t) or S Λ(t) if the parametric form is correctly specified.
 
Frank E. Harrell Jr.
Chapter 18. Case Study in Parametric Survival Modeling and Model Approximation
Abstract
Consider the random sample of 1000 patients from the SUPPORT study, 241 described in Section 3.10. In this case study we develop a parametric survival time model (accelerated failure time model) for time until death for the acute disease subset of SUPPORT (acute respiratory failure, multiple organ system failure, coma). We eliminate the chronic disease categories because the shapes of the survival curves are different between acute and chronic disease categories. To fit both acute and chronic disease classes would require a log-normal model with σ parameter that is disease-specific.
Frank E. Harrell Jr.
Chapter 19. Cox Proportional Hazards Regression Model
Abstract
The Cox proportional hazards model92 is the most popular model for the analysis of survival data. It is a semiparametric model; it makes a parametric assumption concerning the effect of the predictors on the hazard function, but makes no assumption regarding the nature of the hazard function λ(t) itself. The Cox PH model assumes that predictors act multiplicatively on the hazard function but does not assume that the hazard function is constant (i.e., exponential model), Weibull, or any other particular form. The regression portion of the model is fully parametric; that is, the regressors are linearly related to log hazard or log cumulative hazard. In many situations, either the form of the true hazard function is unknown or it is complex, so the Cox model has definite advantages. Also, one is usually more interested in the effects of the predictors than in the shape of λ(t), and the Cox approach allows the analyst to essentially ignore λ(t), which is often not of primary interest.
Frank E. Harrell Jr.
Chapter 20. Case Study in Cox Regression
Abstract
Consider the randomized trial of estrogen for treatment of prostate cancer60 described in Chapter 8. Let us now develop a model for time until death (of any cause). There are 354 deaths among the 502 patients. If we only wanted to test for a drug effect on survival time, a simple rank-based analysis would suffice. To be able to test for differential treatment effect or to estimate prognosis or expected absolute treatment benefit for individual patients, however, we need a multivariable survival model.
Frank E. Harrell Jr.
Backmatter
Metadaten
Titel
Regression Modeling Strategies
verfasst von
Frank E. Harrell Jr.
Copyright-Jahr
2001
Verlag
Springer New York
Electronic ISBN
978-1-4757-3462-1
Print ISBN
978-1-4419-2918-1
DOI
https://doi.org/10.1007/978-1-4757-3462-1