2016 | Book

# Adaptive Regression for Modeling Nonlinear Relationships

Authors: George J. Knafl, Kai Ding

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

2016 | Book

Authors: George J. Knafl, Kai Ding

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

This book presents methods for investigating whether relationships are linear or nonlinear and for adaptively fitting appropriate models when they are nonlinear. Data analysts will learn how to incorporate nonlinearity in one or more predictor variables into regression models for different types of outcome variables. Such nonlinear dependence is often not considered in applied research, yet nonlinear relationships are common and so need to be addressed. A standard linear analysis can produce misleading conclusions, while a nonlinear analysis can provide novel insights into data, not otherwise possible.

A variety of examples of the benefits of modeling nonlinear relationships are presented throughout the book. Methods are covered using what are called fractional polynomials based on real-valued power transformations of primary predictor variables combined with model selection based on likelihood cross-validation. The book covers how to formulate and conduct such adaptive fractional polynomial modeling in the standard, logistic, and Poisson regression contexts with continuous, discrete, and counts outcomes, respectively, either univariate or multivariate. The book also provides a comparison of adaptive modeling to generalized additive modeling (GAM) and multiple adaptive regression splines (MARS) for univariate outcomes.

The authors have created customized SAS macros for use in conducting adaptive regression modeling. These macros and code for conducting the analyses discussed in the book are available through the first author's website and online via the book’s Springer website. Detailed descriptions of how to use these macros and interpret their output appear throughout the book. These methods can be implemented using other programs.

Advertisement

Abstract

Nonlinearity in predictor (or explanatory or independent) variables in regression models for different types of outcome (or response or dependent) variables is often not considered in applied research. While relationships can reasonably be treated as linear in some cases, it is not unusual for them to be distinctly nonlinear. A standard linear analysis in the latter cases can produce misleading conclusions while a nonlinear analysis can provide novel insights into data not otherwise possible. Methods are needed for deciding whether relationships are linear or nonlinear and for fitting appropriate models when they are nonlinear. Methods for these purposes are covered in this book using what are called fractional polynomials based on power transformations of primary predictor variables with real valued powers. An adaptive approach is used to construct fractional polynomial models based on heuristic (or rule-based) searches through power transforms of primary predictor variables. The book covers how to formulate and conduct such adaptive fractional polynomial modeling in a variety of contexts including adaptive regression of continuous outcomes, adaptive logistic regression of dichotomous and polytomous outcomes with two or more values, and adaptive Poisson regression of count/rate outcomes. Power transformation of positive valued continuous outcomes is covered as well as modeling of variances/dispersions with fractional polynomials. The book also covers alternative approaches for modeling of nonlinear relationships including standard polynomials, generalized additive models (GAMs) computed using local regression (loess) and spline smoothing approaches (through SAS PROC GAM), and multivariate adaptive regression splines (MARS) models (through SAS PROC ADAPTIVEREG).

Abstract

This chapter presents a series of analyses of data on death rates per 100,000 for 60 metropolitan statistical areas, addressing how these death rates depend on the primary predictors: the nitric oxide pollution index, the sulfur dioxide pollution index, and the average annual precipitation in inches. These analyses demonstrate adaptive regression modeling of univariate continuous outcomes using fractional polynomials, including how to set the number k of folds for computing k-fold likelihood cross-validation (LCV) scores, how to compare alternative models using LCV ratio tests analogous to likelihood ratio tests, and how to adaptively model variances as well as means. The analyses demonstrate the benefits of considering nonlinearity over linearity in primary predictors and of fractional polynomial modeling over standard polynomial modeling.

The chapter also provides a formulation for multiple regression models of univariate continuous outcomes and for k-fold LCV scores. Other alternatives for conducting cross-validation are defined as well. Penalized likelihood criteria (PLCs), including the Akaike information criterion (AIC), Bayesian information criterion (BIC) and Takeuchi information criterion (TIC), are defined and their use in model selection compared to the use of LCV.

Abstract

This chapter describes how to use the genreg (for general regression) macro for adaptive regression modeling, with models for the means linear in their intercept and slope parameters, and its generated output in the special case of univariate continuous outcomes as also covered in Chap. 2. Example code and output are provided addressing analyses of death rates per 100,000 for 60 metropolitan statistical areas in terms of the nitric oxide pollution index, the sulfur dioxide pollution index, and the average annual precipitation. Issues covered include loading the data; setting the number k of folds for computing k-fold likelihood cross-validation (LCV) scores; generating standard polynomial models, fractional polynomial models, monotonic models, and zero-intercept models; incorporating log transforms and multiple primary predictors; model selection using penalized likelihood criteria (PLCs) rather than LCV; bounding primary predictors; residual analyses; and modeling variances as well as means. Practice exercises are also provided for conducting analyses similar to those presented in Chaps. 2 and 3.

Abstract

This chapter formulates and demonstrates adaptive regression modeling of means and variances for repeatedly measured continuous outcomes treated as multivariate normal. Analyses are presented of dental measurements of the distance in mm from the center of the pituitary to the pterygomaxillary fissure in terms of the age and gender of the child while accounting for dependence of dental measurements for the same child. These are example analyses of data with no missing outcome values. Analyses are also presented of strength in terms of time and type of weightlifting program while accounting for dependence of strength measurements for the same subject. These are example analyses of data with missing outcome values. Analyses of these data sets use marginal models based on order 1 autoregressive (AR1) correlations and exchangeable correlations (EC) and estimated with maximum likelihood (ML) or generalized estimating equations (GEE). They also use transition models, with the current outcome value a function of prior outcome values, and general conditional models, with the current outcome value a function of other, past as well as prior, outcome values. The issue of moderation is addressed, that is, how the effect of a predictor on an outcome can change with values of a moderator variable. For example, how the effect of age on the child’s dental measurements can change with the gender of the child. Moderation analyses are commonly based on interactions, but can be more generally based on geometric combinations, that is, products of power transforms of primary predictors using possibly different powers.

Abstract

This chapter provides a description of how to use the genreg macro for adaptive regression modeling in the case of multivariate continuous outcomes treated as multivariate normally distributed. Data in wide format are often used in multivariate outcome modeling with outcome measurements under different conditions (for example, ages for the dental measurement data analyzed in Chap. 4) in separate variables (columns) and with one observation (row) per matched set of related outcome measurements (for example, the matched sets of the dental measurement data correspond to dental measurements for different children). However, mixed modeling as used in this chapter to analyze multivariate outcome data requires that the data be converted to long format with all outcome measurements in the same variable, an extra variable to identify the measurement condition, and one observation for each outcome measurement, and so an example of such a conversion is presented. Example analyses are provided of marginal modeling of means and variances for multivariate outcomes using either order 1 autoregressive or exchangeable correlations with parameter estimates based on either maximum likelihood (ML) or on generalized estimating equations (GEE). Example analyses are also presented of transition modeling and general conditional modeling of means and variances for multivariate outcomes. Example residual analyses are presented as well, together with sensitivity analyses to assess the impact of outlying observations. Practice exercises are provided for conducting analyses similar to those presented in Chaps. 4 and 5.

Abstract

This chapter presents analyses of several data sets with positive valued univariate or multivariate continuous outcomes addressing the need for power transformation of those outcomes along with power transformation of predictors for those outcomes. The outcome variables include those analyzed in Chaps. 2–5 as well as a new data set on plasma levels of beta-carotene in humans in terms of their fiber intake and vitamin usage. The chapter also provides a formulation for power-adjusted likelihood cross-validation (LCV) scores that can be maximized to choose a real valued power for transforming an outcome.

Abstract

This chapter describes how to use the ypower macro for adaptive regression modeling accounting for fractional polynomial transformation of positive valued univariate and multivariate continuous outcomes as well as their predictors as also covered in Chap. 6. Example code and output are provided for analyzing the univariate outcome plasma beta-carotene levels for 314 subjects in terms of their fiber intake and vitamin usage and the multivariate outcome dental measurements for 27 children in terms of their age and gender. Practice exercises are also provided for conducting analyses similar to those presented in Chaps. 6 and 7.

Abstract

This chapter presents adaptive analyses of mercury level data, addressing how mercury level categorized as low and high (with cutoff 1.0 ppm near its median) and as low, medium, and high (with cutoffs 0.72 and 1.3 ppm near its tertiles) for \( \mathrm{n}=169 \) largemouth bass caught in the Lumber and Waccamaw Rivers of North Carolina depends on weight, length, and river. These analyses demonstrate adaptive logistic regression using fractional polynomials, including modeling of dichotomous and polytomous outcomes, extensions to adaptive multinomial and ordinal regression for polytomous outcomes, and how to model dispersions as well as means. Formulations are also provided for these alternative regression models, for associated k-fold LCV scores for unit dispersions models, extended quasi-likelihood cross-validation (\( {\mathrm{QLCV}}^{+} \)) scores for non-unit dispersions models based on extended quasi-likelihoods, for odds ratio (OR) functions generalizing the OR used in standard logistic regression, and for residuals and standardized or Pearson residuals. The example analyses demonstrate assessing whether the logits (or log odds) for an outcome are nonlinear in individual predictors, whether those relationships are better addressed with multiple predictors in combination compared to using singleton predictors, whether those relationships are additive in predictors, whether the predictors interact using geometric combinations, whether ordinal polytomous outcomes are better modeled with ordinal or multinomial regression, and whether there is a benefit to considering non-unit dispersions.

Abstract

This chapter describes how to use the genreg macro for adaptive logistic regression modeling as described in Chap. 8 and its generated output in the special case of univariate dichotomous or polytomous outcomes. Example analyses are provided for modeling means and dispersions for mercury in fish categorized into the dichotomous levels of high and low and into the polytomous levels of high, medium, and low in terms of weight and length of the fish and the river in which they were caught. Adaptive ordinal and multinomial regression for polytomous outcomes are demonstrated. Residual analyses based on continuous predictors, like weight and length, of dichotomous and polytomous outcomes are better conducted using grouped data. Formulations and example analyses are provided for grouped-data residual analyses of both dichotomous and polytomous outcomes.

Abstract

This chapter formulates and demonstrates adaptive fractional polynomial modeling of means and dispersions for repeatedly measured dichotomous and polytomous outcomes with two or more values. Marginal modeling extends from the multivariate normal outcome context to the multivariate dichotomous and polytomous outcome context. However, due to the complexity in general of computing likelihoods and quasi-likelihoods (as needed to account for non-unit dispersions) for general multivariate marginal modeling, generalized estimating equations (GEE) techniques are often used instead, thereby avoiding computation of likelihoods and quasi-likelihoods. This complicates the extension of adaptive modeling to the GEE context since it is based on cross-validation (CV) scores computed from likelihoods or likelihood-like functions, but a readily computed extended likelihood is formulated for use in adaptive GEE-based modeling of multivariate dichotomous and polytomous outcomes. Conditional modeling also extends to the multivariate dichotomous and polytomous outcome context, both transition modeling and general conditional modeling. In contrast to marginal GEE-based modeling, conditional modeling of means for multivariate dichotomous and polytomous outcomes with unit dispersions is based on pseudolikelihoods that can be used to compute pseudolikelihood CV (PLCV) scores on which to base adaptive transition and general conditional modeling of multivariate dichotomous and polytomous outcomes. These marginal and conditional models can be extended to model dispersions as well as means. Example analyses of these kinds are presented of post-baseline respiratory status over time for patients with respiratory disorder in terms of the baseline respiratory status, time, and being on an active as opposed to a placebo treatment.

Abstract

This chapter describes how to use the genreg macro for adaptive logistic regression modeling of multivariate dichotomous and polytomous outcomes as described in Chap. 10 as well as its generated output. Example are provided for modeling means and dispersions for post-baseline respiratory status in terms of time, baseline respiratory status, and being on active treatment as opposed to taking a placebo. The analyses consider both dichotomous respiratory status, categorized as poor or good, and polytomous respiratory status, categorized as poor or good or excellent. Ordinal regression and multinomial regression models are considered for polytomous respiratory status. Examples are presented for transition modeling and GEE-based marginal modeling of dichotomous and polytomous respiratory status. An example residual analysis is presented for dichotomized respiratory status.

Abstract

This chapter presents adaptive analyses of data on the incidence of non-melanoma skin cancer for women in St. Paul, Minnesota and Fort Worth, Texas, addressing how skin cancer rates for women of varying ages in these two locations depend on age and location. These analyses demonstrate adaptive Poisson regression modeling of univariate count outcomes using fractional polynomials, including modeling means of univariate count outcomes, possibly adjusted to rate outcomes through offsets, and modeling their dispersions as well as means. Formulations are also provided for these alternative regression models, for associated k-fold LCV scores for unit dispersions models, extended quasi-likelihood cross-validation (QLCV^{+}) scores for non-unit dispersions models based on extended quasi-likelihoods, and for residuals and standardized or Pearson residuals. The example analyses demonstrate assessing whether the log of the means of an outcome is nonlinear in individual predictors, whether those relationships are better addressed with multiple predictors in combination compared to using singleton predictors, whether those relationships are additive in predictors, whether the predictors interact using geometric combinations, and whether there is a benefit to considering constant dispersions compared to unit dispersions and non-constant dispersions compared to constant dispersions.

Abstract

This chapter describes how to use the genreg macro for adaptive Poisson regression modeling as described in Chap. 12 and its generated output in the special case of univariate count outcomes, possibly converted to rate outcomes through offsets. Example analyses are provided for modeling means and dispersions for non-melanoma skin cancer rates for women of varying ages residing in St. Paul, Minnesota and Fort Worth, Texas, addressing how these rates depend on age and location of residence. One of these analyses provides an example for which adaptive modeling distinctly outperforms recommended degree 1 and 2 fractional polynomials.

Abstract

This chapter formulates and demonstrates adaptive fractional polynomial modeling of means and dispersions for repeatedly measured count outcomes, possibly converted to rates using offsets. Marginal modeling extends from the multivariate normal outcome context to the multivariate count/rate outcome context. However, due to the complexity in general of computing likelihoods and quasi-likelihoods (as needed to account for non-unit dispersions) for general multivariate marginal modeling, generalized estimating equations (GEE) techniques are often used instead, thereby avoiding computation of likelihoods and quasi-likelihoods. This complicates the extension of adaptive modeling to the GEE context since it is based on cross-validation (CV) scores computed from likelihoods or likelihood-like functions, but a readily computed extended likelihood is formulated for use in adaptive GEE-based modeling of multivariate count/rate outcomes. Conditional modeling also extends to the multivariate count/rate outcome context, both transition modeling and general conditional modeling. In contrast to marginal GEE-based modeling, conditional modeling of means for multivariate count/rate outcomes with unit dispersions is based on pseudolikelihoods that can be used to compute pseudolikelihood CV (PLCV) scores on which to base adaptive transition and general conditional modeling of multivariate count/rate outcomes. These marginal and conditional models can be extended to model dispersions as well as means. Example analyses of these kinds are presented of the post-baseline seizure rates per day over time for patients with epilepsy in terms of the baseline seizure rate, clinic visit, and treatment group (prescribed the drug progabide versus a placebo).

Abstract

This chapter describes how to use the genreg macro for adaptive Poisson regression modeling of multivariate count outcomes, possibly converted to rates using offsets, as described in Chap. 14 as well as its generated output. Examples are provided for modeling means and dispersions for post-baseline seizure rates in terms of clinic visit, the baseline seizure rate, and being on the anti-epileptic drug progabide as opposed to a placebo. Examples are presented for transition modeling and GEE-based marginal modeling of seizure rates.

Abstract

This chapter formulates and demonstrates generalized additive models (GAMs) for means of continuous outcomes treated as independent and normally distributed with constant variances as in linear regression and for logits (log odds) of means of dichotomous discrete outcomes with unit dispersions as in logistic regression. GAMs provide an alternative to fractional polynomial models for modeling nonlinear relationships between univariate outcomes and predictors, and so GAMs for these two cases are also compared to adaptive fractional polynomial models. Poisson regression is not considered for brevity. Example analyses are provided of the univariate continuous outcome deathrate per 100,000 in terms of available predictors as also addressed in Chaps. 2, 3, 6 and 7 as well as the univariate dichotomous outcome a high mercury level in fish over 1.0 ppm versus a lower level in terms of available predictors as also addressed in Chaps. 8 and 9.

Abstract

This chapter provides a description of how to use PROC GAM for generating generalized additive models (GAMs) for univariate continuous and dichotomous outcomes as well as how to evaluate and compare GAMs with likelihood cross-validation (LCV) scores. Comparison of GAMS to adaptive fractional polynomial models on the basis of LCV scores is also covered. Example code is provided for generating models for predicting the univariate continuous outcome death rate per 100,000 in terms of available predictors as also addressed in Chaps. 2, 3, 6, 7 and 16 as well as models for predicting the univariate dichotomous outcome a high mercury level in fish over 1.0 ppm versus a lower level in terms of available predictors as also addressed in Chaps. 8, 9 and 16.

Abstract

This chapter demonstrates multivariate adaptive regression splines (MARS) for modeling of means of continuous outcomes treated as independent and normally distributed with constant variances as in linear regression and of logits (log odds) of means of dichotomous discrete outcomes with unit dispersions as in logistic regression. MARS models provide an alternative to fractional polynomial models for modeling nonlinear relationships between univariate outcomes and predictors, and so MARS models for these two cases are compared to adaptive fractional polynomial models. Poisson regression is not considered for brevity. MARS models can be also adjusted by adaptively power transforming their splines. Example analyses are provided of the univariate continuous outcome death rate per 100,000 in terms of available predictors as also addressed in Chaps. 2, 3, 16 and 17 and the univariate dichotomous outcome a high mercury level in fish over 1.0 ppm versus a lower level in terms of available predictors as also addressed in Chaps. 8, 9, 16 and 17.

Abstract

This chapter provides a description of how to use PROC ADAPTIVEREG for generating multivariate adaptive regression splines (MARS) models for univariate continuous and dichotomous outcomes as well as how to evaluate and compare MARS models with likelihood cross-validation (LCV) scores. Comparison of MARS models to adaptive fractional polynomial models on the basis of LCV scores is also covered as well as how to adaptively transform MARS models. Example code is provided for generating models for predicting the univariate continuous outcome death rate per 100,000 in terms of available predictors as also addressed in Chaps 2, 3, 16, and 17 as well as models for predicting the univariate dichotomous outcome a high mercury level in fish over 1.0 ppm versus a lower level in terms of available predictors as also addressed in Chaps 8, 9, 16 and 17.

Abstract

This chapter provides a general formulation for adaptive regression modeling of nonlinear relationships. Since formulations for special cases have been provided earlier, only overviews are presented for alternative types of regression models and alternative cross-validation scoring approaches. A detailed formulation for the adaptive regression modeling process used by the genreg macro is provided, which has only been generally described earlier.