Skip to main content
Log in

Outside and inside the regression “black box” from exploratory to interior data analysis

  • Published:
Quality and Quantity Aims and scope Submit manuscript

Abstract

Good data analysis consists of three phases: (1) preliminary analysis, (2) confirmatory analysis (model testing), and (3) interior analysis (model checking). Social scientists doing quantitative research usually concentrate on only one of the three: confirmatory analysis. I argue that there is much to be learned from careful preliminary and interior analyses. I present an extensive example of data analysis for each of the three phases using the same data set in each phase. Rather than surveying all the possible tools available in each phase of data analysis, I concentrate on Exploratory Data Analysis techniques (stem-and-leaf plot, letter-value display, box plot, and power transformations) for the preliminary phase, on OLS for the confirmatory phase, and on residuals, leverage and single-case influence measures for interior analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Andrews, David F. (1971). Significance tests based on residuals,Biometrika 58(1), 139–148.

    Google Scholar 

  • Andrews, David F. and Pregibon, Daryl (1978). Finding the outliers that matter,Journal of the Royal Statistical Society, Series B 40(1), 85–93.

    Google Scholar 

  • Atkinson, A. C. (1981). Two graphical displays for outlying and influential observations in regression,Biometrika 68(1), 13–20.

    Google Scholar 

  • Atkinson, A. C. (1988). Transformations unmasked,Technometrics 30(3), 311–8.

    Google Scholar 

  • Behnken, Donald W. and Draper, Norman R. (1972). Residuals and their variance patterns,Technometrics 14(1), 101–111.

    Google Scholar 

  • Belsley, David, Kuh, Edwin, and Welsch, Roy. (1980).Regression Diagnostics, Identifying Influential Data and Sources of Collinearity, New York: Wiley & Sons.

    Google Scholar 

  • Bollen, Kenneth A. and Jackman, Robert W. (1985). Regression diagnostics. An expository treatment of outliers and influential cases,Sociological Methods & Research 13(4), 510–542.

    Google Scholar 

  • Box, George E. P. and Cox, Dennis R. (1964). An Analysis of transformations,Journal of the Royal Statistical Society, Series B 26, 211–43, discussion 244–52.

    Google Scholar 

  • Chatterjee, Samprit and Hadi, Ali S. (1986). Influential observations, high leverage points, and outliers in linear regression,Statistical Science 1(3), 379–393.

    Google Scholar 

  • Cohen, Ayala (1984). Exploratory data analysis methods,Sociological Methods and Research 12(4), 433–452.

    Google Scholar 

  • Cook, R. Dennis (1977). Detection of influential observations in linear regression,Technometrics 19(1), 15–18.

    Google Scholar 

  • Cook, R. Dennis (1979). Influential observations in linear regression,Journal of the American Statistical Association 74(365), 169–174.

    Google Scholar 

  • Cook, R. Dennis (1986). “Comment” to S. Chatterjee and A. S. Hadi, Influential observations, high leverage points, and outliers in linear regression,Statistical Science 1(3), 393–7.

    Google Scholar 

  • Cook, R. Dennis and Weisberg, Sanford (1980). Characterizations of an Empirical Influence function for detecting influential cases in regression,Technometrics 22(4), 495–508.

    Google Scholar 

  • Cook, R. Dennis and Weisberg, Sanford (1982). Criticism and influence analysis in regression, in S. Leinhardt (ed.),Sociological Methodology, pp. 313–361.

  • Cox, D. R. and Snell, E. J. (1981).Applied Statistics, Principles and Examples, London: Chapman and Hall.

    Google Scholar 

  • Daniel, Cuthbert T. and Wood, Fred S. (1980).Fitting Equations to Data, New York: John Wiley & Sons.

    Google Scholar 

  • Draper, Norman R. and John, J. A. (1981). Influential observations and outliers in regression,Technometrics 23(1), 21–26.

    Google Scholar 

  • Draper, Norman R. and Smith, Jr,. Harry. (1981).Applied Regression Analysis, second edition, New York.

  • Emerson, John D. (1983). Mathematical aspects of transformations, in David Hoaglinet al. Op. cit. 247–282.

    Google Scholar 

  • Emerson, John D. and Hoaglin, David C. (1983). Stem-and-leaf displays, in David Hoaglin, Frederick Mosteller, and John Tukey (eds.),Understanding Robust and Exploratory Data Analysis, New York: John Wiley.

    Google Scholar 

  • Emerson, John D. and Strenio, Judith. (1983). Boxplots and batch comparison, in David Hoaglinet al., Op. cit., pp. 58–96.

    Google Scholar 

  • Emerson, John D. and Stoto, Michael (1983). Transforming data, in David Hoaglinet al., Op. cit., pp. 97–128.

    Google Scholar 

  • Gentleman, J. F. and Wilk, M. B. (1975). Detecting outliers. II. Supplementing the direct analysis of residuals,Biometrics 31, 387–410.

    Google Scholar 

  • Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum of matrices,SIAM Review 23, 53–60.

    Google Scholar 

  • Hoaglin, David C. (1983). Letter Values: A set of selected order statistics, in David Hoaglinet al., Op. cit., pp. 33–57.

    Google Scholar 

  • Hoaglin, David, Mosteller, Frederick, and Tukey, John (eds.), (1983).Underwsanding Robust and Exploratory Data Analysis, New York: John Wiley.

    Google Scholar 

  • Hoaglin. David D. and Welsch, Roy E. (1978). The hat matrix in regression and ANOVA,The American Statistician 32(1), 17–22.

    Google Scholar 

  • Hoaglin, David C. and Kempthorne, Peter J. (1986) “Comment” to S. Chatterjee and A. S. Hadi. (1986). Influential observations, high leverage points, and outliers in linear regression,Statistical Science 1(3), 408–12.

    Google Scholar 

  • Hocking, R. R. (1983). Developments in Linear Regression Methodology: 1959–1982,Technometrics 25(3), 219–30.

    Google Scholar 

  • Huber, P. (1981).Robust Statistics, New York: John Wiley.

    Google Scholar 

  • Jenkins, G. (1979).Practical Experiences with Modelling and Forecasting Time Series, Jersey, Channel Islands: GJP.

    Google Scholar 

  • John, J. A. and Draper, N. R. (1978). On testing for two outliers or one outlier in two-way tables,Technometrics 20(1), 69–78.

    Google Scholar 

  • Johnston, J. (1984).Econometric Methods, New York: McGraw-Hill Book Company.

    Google Scholar 

  • Leinhardt, S. and Wasserman, S. S. (1979). Exploratory Data Analysis: An Introduction to Selected Methods, in K. F. Schuessler (ed.),Sociological Methodology 311–365.

  • Lund, Richard E. (1975). Tables for an approximate test for outliers in linear models,Technometrics 17(4), 473–476.

    Google Scholar 

  • Montgomery, Douglas C., Martin, Edith W., and Peck, Elizabeth A. (1980). Interior analysis of the observations in multiple linear regression,Journal of Quality Technology 12(3), 165–173.

    Google Scholar 

  • Mosteller, Frederick and Tukey, John (1977).Data Analysis and Regression, Reading, Mass.: Addison-Wesley Publishing Co.

    Google Scholar 

  • Prescott, P. (1975). An approximate test for outliers in linear models,Technometrics (17)1, 129–132.

    Google Scholar 

  • Rao, C. R. (1973).Linear Statistical Inference and its Applications, New York: John Wiley & Sons.

    Google Scholar 

  • Tietjen, G. L., Moore, R. H., and Beckman, R. J. (1973). Testing for a single outlier in simple linear regression,Technometrics 15(4), 717–721.

    Google Scholar 

  • Tukey, John (1977).Explortory Data Analysis, Reading, Mass.: Addison-Wesley Publishing Co.

    Google Scholar 

  • Velleman, Paul F. and Welsch, Roy E. (1981). Efficient computing of regression diagnostics,The American Statistician 35(4), 234–242.

    Google Scholar 

  • Walker, Esteban and Birch, Jeffrey (1988). Influence measures in ridge regression,Technometrics 30(2), 221–7.

    Google Scholar 

  • Weisberg, Sanford (1980).Applied Linear Regression, New York: John Wiley and Sons.

    Google Scholar 

  • Weisberg, Sanford (1983). Discussion to R. R. Hocking, Developments in linear regression methodology: 1959–1982,Technometrics 25(3), 240–4.

    Google Scholar 

  • Welsch, Roy E. and Kuh, Edwin (1977). Linear regression diagnostics, Sloan School of Managmeent Working Paper, Masachussetts Institute of Technology, Cambridge, Mass., pp. 923–77.

    Google Scholar 

  • Welsch, Roy E. (1986). “Comment” to S. Chatterjee and A. S. Hadi, Influential observations, high points, and outliers in linear regression,Statistical Science 1(3), 403–5.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Franzosi, R. Outside and inside the regression “black box” from exploratory to interior data analysis. Qual Quant 28, 21–53 (1994). https://doi.org/10.1007/BF01098725

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01098725

Keywords

Navigation