Traditional regression analysis consists of fitting an “a priori” specified model in which the predictors are (ideally) uncorrelated. In contrast to this approach, however, most applications in contemporary regression belong to the exploratory data analysis framework (e.g., Box, 1983): An initial, tentative model, possibly with correlated predictors, is proposed and evolves throughout an iterative process that concludes with a final chosen equation. The central issue in this process is how to select the predictors that will be included in the final model. Mitchell and Beauchamp (1988) have provided some of the reasons for selecting a best (in some sense) set of predictors: (1) to express the relationship between the criterion and the predictors as simply as possible; (2) to reduce future prediction cost; (3) to identify the important and the negligible predictors; or (4) to increase the precision of statistical estimates and predictions. However, the procedures that we shall discuss here are designed to select the best set of predictors and are not intended to address more complex issues such as the assessment of directional influences among the predictors, interaction effects, or suppressor effects. This may be considered to be a limitation (for a further discussion, see Bring, 1994, 1995). These issues are usually dealt with by using structural equation models.

Various procedures have been proposed for finding an optimal set of Q predictors from the P potential predictors—for example, the Akaike information criterion (Akaike, 1973), the C p criterion (Mallows, 1973), or the Bayesian information criterion (Akaike, 1978; Schwarz, 1978). These procedures are based on a comparison of all 2P possible sets. So, when P is large, the computational requirements can be prohibitive. As a practical solution, practitioners typically use heuristic methods to reduce the number of potential predictors: stepwise selection, forward selection, or backward elimination (see, e.g., Miller, 1990, for a detailed discussion). These methods sequentially include (or exclude) predictors based on the assessment of the significant changes in R 2.

An alternative and apparently simple approach to the selection problem is to choose the most important predictors. However, as Nunnally and Bernstein (1994, pp. 191–193) noted, the definition of importance is not unique, and different situations might require different definitions. One variable may explain the most variance when the remaining predictors are ignored (e.g., have the largest squared validity), a second may contribute to the most unique variance (e.g., have the largest beta weight), and a third may increment R 2 the most, relative to a given subset of predictors. Since these three types of indices answer different questions, the researcher should decide which of these should be inspected in each particular situation. Traditionally, however, researchers assess the relative importance (RI) of predictors simply by examining their standardized regression coefficients and squared semipartial correlations. These measures, however, are context dependent, and, when the predictors are correlated, they cannot be used to unambiguously determine the contribution of the predictor to the explained criterion variance (see, e.g., Budescu, 1993). To overcome this problem, several methods have been developed for assessing the RI of the predictors, among which are dominance analysis (Azen & Budescu, 2003; Budescu, 1993; Chevan & Sutherland, 1991) and Johnson’s epsilon (ε; J. W. Johnson, 2000), initially proposed by Fabbris (1980). These RI indices are generally used to help determine importance when a researcher has no theoretical ordering of predictor variables (Baltes, Parker, Young, Huff, & Altmann, 2004). However, they can also be used in a more confirmatory way to assess whether some “a priori” or theoretically derived ordering is supported by empirical data.

Dominance analysis has three criteria for determining dominance: complete, conditional, and general dominance (C x ). These criteria go from more strict dominance to weak dominance. Johnson’s ε makes use of the suggestion made by Gibson (1962) and R. M. Johnson (1966) that the RI of a set of predictors can be approximated by creating a set of variables that are highly related to the original set of variables but are uncorrelated with each other. Theoretically, both approaches are well founded and provide meaningful and clearly interpretable results. In practice, Johnson’s ε provides essentially the same results as Budescu’s C x , (see Johnson, 2000). Two SPSS programs that are focused on Johnson’s approach and that compute currently recommended techniques and recent developments for assessing the relevance of predictors can be found in Lorenzo-Seva, Ferrando, and Chico (2010).

Tonidandel, LeBreton, and Johnson (2009) proposed a technique for testing the statistical significance of relative weights (RWs). They suggested that the regression model be extended with a randomly generated variable that is not related to the criterion in the population. The randomly generated variable represents a variable with zero importance in the population. However, due to sampling error, the RW will almost always be nonzero in the sample. To test the significance of the RW produced by a theoretically meaningful variable in a data set, they proposed comparing the corresponding RW with the RW produced by the randomly generated variable. They proposed that the RWs of both variables should be repeatedly subtracted across a large number of bootstrap samples. The standard deviation of the differences across samples is an estimate of the standard error of the difference and can be used to compute a confidence interval: If the confidence interval does not include the zero value, the RWs are significantly different (i.e., the RW produced by the theoretically meaningful variable significantly differs from zero in the population). Finally, to obtain the confidence interval, they suggested using the bias-corrected accelerated (BCa) method for obtaining bootstrap confidence intervals (see Efron & Tibshirani, 1993, pp. 184–188).

Even when a sensible selection procedure is used, if the data are the same, the final chosen equation can be misleading. Rencher and Pun (1980) showed that R 2 is usually overestimated in a subset regression model (especially when there are fewer observations than predictors). Breiman (1988) showed that models selected by data-driven methods may produce biased estimated mean squared prediction errors.

Hurvich and Tsai (1990) specifically warned against using the same set of data to select the predictors and infer the true model in a population. They advised a cross-validation procedure based on splitting data into two subsamples: one to select the predictors (i.e., the estimation subsample), and the other to infer the model in the population (i.e., the prediction subsample). Methods for optimally splitting the data are available (see Picard & Berk, 1990, for an overview and practical guidelines)—for example, Kennard and Stone’s (1969) algorithm and the duplex algorithm (Snee, 1977). Kennard and Stone’s algorithm selects a subsample of observations that covers the multidimensional space in a uniform manner by maximizing the Euclidean distances between the predictors. One of its shortcomings is that it does not take into account the prediction subsample, which is solved in the duplex algorithm. The duplex algorithm starts by selecting the two elements in the sample that have the greatest Euclidean distance between them and putting them into the first subsample. Then, of the remaining candidates, the two elements farthest from each other are put into the second subsample. In the next step, consecutive elements are selected and put alternatively in the first and second subsamples, the element added being the one farthest away from all the elements already in the subsample. To determine which object is the farthest one, the same criterion as that in the Kennard and Stone algorithm is used (i.e., the Euclidean distances). This selection method guarantees the representativeness of the subsamples (i.e., all possible sources of variance are enclosed in the subsamples). Finally, it must be stressed that if a data-splitting procedure is to be used, the sample must be large enough. Snee suggested that the number of observations N should be larger than 2P + 20 to allow data splitting. Although the duplex algorithm was proposed some years ago, it is still a useful method in multiple regression analysis (see, e.g., Caprona, Walczaka, de Noordb, & Massarta, 2005).

In certain types of research, predictors are considered to be fallible measures of hypothetical latent variables, and the relations of most interest are those that would be obtained if the predictors were totally free of error. This situation is usually addressed within the general framework of structural equation modeling. However, it can also be dealt with in the present framework, perhaps as a first step before more complex models are considered.

The presence of measurement error attenuates correlations. So, if the predictors have different degrees of reliability, the correlations among them and with the criterion are differentially attenuated with respect to the “true” correlations of interest. It then follows that the beta weights, the structure coefficients, and the RWs are all potentially affected by the presence of measurement error. If some sort of reliability estimates of the predictors’ scores are available (e.g., internal consistency or test–retest), the usual correction for attenuation formula (e.g., Nunnally, 1978) can be applied to correct the correlation matrix used as input.

In this article, we propose a general heuristic procedure for variable selection in multiple linear regression analysis. Our approach assesses the RI of the predictors and uses a cross-validation schema scheme. The usefulness of the procedure is illustrated with real data.

FIRE: an SPSS program for variable selection

We created an SPSS program to implement the approaches described above. The program runs automatically from the SPSS (Norusis, 1988) syntax window, and the output can be configured in a variety of ways. Specifically, the program was developed on the basis of the MATRIX command language (see, e.g., Einspruch, 2003, pp. 137-–49). It should be noted, however, that users do not need to know how to program in this language in order to run FIRE; they need only specify the values of some variables in order to adapt the syntax to the data at hand. The Appendix shows the extract of the code that can be modified by the user so that the syntax can be adapted. The following computation parameters can be configured: (1) the reliability of each predictor in the data set; (2) the number of bootstrap samples to be used; (3) the proportion of cases to be reserved for the prediction sample; (4) the removal or not of predictors related to nonsignificant RWs from the prediction sample; (5) the production or not of a graph of residuals; (6) the amount of (detailed) information presented in the output; and (7) the default path for saving temporary files.

To run FIRE, the user has to have an active SPSS data file containing the criterion (as the first variable in the data file) and the predictors (i.e., the rest of the variables in the data file). The computations can be summarized in the following four consecutive steps:

  1. Step 1

    The duplex algorithm is computed to optimally split data into two subsamples: One is used to select the predictors (estimation subsample), and the other is used to infer the model in the population (prediction subsample). If reliability estimates are provided, all the correlations in all the samples are corrected by taking into account the unreliability of the predictors. The descriptive statistics of the overall sample and both subsamples are printed.

  2. Step 2

    R2 and its significance test are computed in the estimation sample. If R2 is nonsignificant, the analysis is halted, and a warning printed. Otherwise, the technique proposed by Tonidandel et al. (2009) for testing the statistical significance of RWs is computed. The BCa method of obtaining bootstrap 99% confidence intervals is defined. If the user allows automatic variable selection (option 4 above), the variables related to nonsignificant RWs are removed from the prediction subsample.

  3. Step 3

    The prediction subsample is used to compute the statistics that are usually reported in standard multiple linear analysis: Both point-estimated indices and the 95th percentile confidence intervals (obtained using a bootstrap approach) are computed. Our aim is to provide a self-contained program, so that the user can obtain all the necessary information to interpret the results, with no need to use other programs. The indices reported are R, R 2, intercept, beta weights, unstandardized regression weights, structure coefficients, Johnson’s RWs, and Johnson’s significance test for comparing the RI of predictors.

  4. Step 4

    Residual statistics are computed for the whole sample, and the corresponding descriptive statistics are printed. The statistics computed are predicted values, standardized residuals, Studentized residuals, leverage values, and Cook’s D distances. These residuals are saved in a temporal file (TMP.SAV) in the folder defined by the user. If the user so allows, a residual scatterplot (standardized residual values vs. criterion values) is produced.

Illustrative example

A sample of 562 undergraduates from a Spanish university completed a battery of six personality questionnaires. The aim of the study was to determine an optimal set of predictors to predict stress. The initial set consisted of five potential predictors: (1) two scales (Positive Affect [PA] and Negative Affect [NA]) of the PANAS (Sandín et al., 1999; Watson, Clark, & Tellegen, 1988); (2) the Neuroticism (N) scale of the EPQ–R (Aguilar, Tous, & Andrés, 1990; Eysenck, Eysenck, & Barrett, 1985); and (3) the two scales (Optimism and Pessimism) of the Life Orientation Test (Otero-López, Luengo, Romero, Gómez, & Castro, 1998; Scheier, Carver, & Bridges, 1994). The criterion was the Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983; Ferrando, Chico, & Tous, 2002).

We split data into two subsamples of 281 individuals each. The correlations between predictors were corrected for attenuation using the reliabilities of the Spanish version of the different scales.

We started the analysis using the estimation subsample. Since R 2 was .568 (F = 70.19, p < .001), we concluded that at least one of the predictors should be related to the criterion in the population. Table 1 shows the empirical relative RWs and the BCa 99% confidence interval of the difference with an uncorrelated predictor. Only three predictors (PA, NA, and N) turned out to be significant, so this set of predictors was taken as the optimal subset of predictors.

Table 1 Outcome of the method for selecting the optimal subset of predictors

Then we assessed the results in the prediction subsample with the chosen subset of optimal predictors. The new R 2 was .472 (F = 83.11, p < .001), and the corresponding 95% confidence interval was .379 and .572. So, even if the value of R 2 in the prediction subsample was lower than the corresponding value in the estimation subsample, the difference cannot be considered statistically significant. The relative contributions to Multiple R were 41.9, 19.0, and 39.1 for NA, PA, and N, respectively. These values suggest that NA is the most influential predictor (as had already been observed in the prediction subsample). In addition, N seemed to be more influential than PA (which was not the case in the prediction subsample). We also computed Johnson’s significance test for comparing the RI of the predictors (Johnson, 2004). This test compares the bootstrap 95% confidence intervals for pairwise differences between predictors. We used 3,000 bootstrap samples, and the results are shown in Table 2. Since the 95% confidence intervals always included the zero value, the differences between predictors cannot be considered significant. We concluded that (1) the three predictors significantly contributed to explaining criterion variance and (2) the differences observed between them could not be expected to be found in the population.

Table 2 Johnson’s significance test for comparing the relative importance of predictors: Bootstrap 95% confidence intervals for pairwise differences between predictors

Discussion

In most contemporary applications, multiple regression is used in an exploratory way: A tentative model based on a set of potentially relevant predictors is initially proposed, and then the model evolves through a process that concludes with a final chosen regression equation. In practice, this general approach has two main potential problems: (1) incorrect selection of the predictors that form the final equation, and (2) biased and misleading results that arise because the same data are used and reused in model modification. The literature review clearly shows that these problems are quite common in applications (Breiman, 1988; A. Cohen, 1991; Hurvich & Tsai, 1990).

The present article addresses problems (1) and (2) above and proposes a general program for better use of multiple regression in applied research. We propose a sequential approach for doing multiple regression analysis that is implemented in an SPSS-based program. The sequential framework consists of (1) optimally splitting the data for cross-validation using the duplex algorithm, (2) selecting the final set of predictors to be retained in the equation regression, and (3) assessing the behavior of the chosen model using standard indices and procedures. The results of the illustrative example suggest that the approach leads to meaningful results in applied settings. Our experience suggests that proposals such as the present one are put into practice only if they are implemented in self-contained user-friendly programs. We believe that FIRE fulfils these requirements.

Finally, as an anonymous reviewer pointed out, the (nondefault) option of the program that automatically removes nonsignificant predictors from the model might promote a rather dangerous practice: eliminating predictors from the model without proper reflection about the substantive reasons for doing so or how it would behave if it were to be included in a different subset. Our advice is to use FIRE in a careful and thoughtful way. For example, if a large number of predictors turn out to be nonsignificant in a particular data set, the user could choose not to eliminate all of them at the same time, but one (or a few) at a time. To do this with FIRE, the predictors to be excluded from the model should be eliminated from the SPSS file before FIRE is run. In addition, the predictors should be successively eliminated with different subsamples.

Program availability

The Appendix show only a small portion of the SPSS code (i.e., the code lines that can be modified by the user). The SPSS syntax, a short manual, and data files related to this article are available as supplemental materials from brm.psychonomic-journals.org/content/supplemental. Alternatively, these materials can be obtained free of charge by e-mail (urbano.lorenzo@urv.cat).