Abstract
Good data analysis consists of three phases: (1) preliminary analysis, (2) confirmatory analysis (model testing), and (3) interior analysis (model checking). Social scientists doing quantitative research usually concentrate on only one of the three: confirmatory analysis. I argue that there is much to be learned from careful preliminary and interior analyses. I present an extensive example of data analysis for each of the three phases using the same data set in each phase. Rather than surveying all the possible tools available in each phase of data analysis, I concentrate on Exploratory Data Analysis techniques (stem-and-leaf plot, letter-value display, box plot, and power transformations) for the preliminary phase, on OLS for the confirmatory phase, and on residuals, leverage and single-case influence measures for interior analysis.
Similar content being viewed by others
References
Andrews, David F. (1971). Significance tests based on residuals,Biometrika 58(1), 139–148.
Andrews, David F. and Pregibon, Daryl (1978). Finding the outliers that matter,Journal of the Royal Statistical Society, Series B 40(1), 85–93.
Atkinson, A. C. (1981). Two graphical displays for outlying and influential observations in regression,Biometrika 68(1), 13–20.
Atkinson, A. C. (1988). Transformations unmasked,Technometrics 30(3), 311–8.
Behnken, Donald W. and Draper, Norman R. (1972). Residuals and their variance patterns,Technometrics 14(1), 101–111.
Belsley, David, Kuh, Edwin, and Welsch, Roy. (1980).Regression Diagnostics, Identifying Influential Data and Sources of Collinearity, New York: Wiley & Sons.
Bollen, Kenneth A. and Jackman, Robert W. (1985). Regression diagnostics. An expository treatment of outliers and influential cases,Sociological Methods & Research 13(4), 510–542.
Box, George E. P. and Cox, Dennis R. (1964). An Analysis of transformations,Journal of the Royal Statistical Society, Series B 26, 211–43, discussion 244–52.
Chatterjee, Samprit and Hadi, Ali S. (1986). Influential observations, high leverage points, and outliers in linear regression,Statistical Science 1(3), 379–393.
Cohen, Ayala (1984). Exploratory data analysis methods,Sociological Methods and Research 12(4), 433–452.
Cook, R. Dennis (1977). Detection of influential observations in linear regression,Technometrics 19(1), 15–18.
Cook, R. Dennis (1979). Influential observations in linear regression,Journal of the American Statistical Association 74(365), 169–174.
Cook, R. Dennis (1986). “Comment” to S. Chatterjee and A. S. Hadi, Influential observations, high leverage points, and outliers in linear regression,Statistical Science 1(3), 393–7.
Cook, R. Dennis and Weisberg, Sanford (1980). Characterizations of an Empirical Influence function for detecting influential cases in regression,Technometrics 22(4), 495–508.
Cook, R. Dennis and Weisberg, Sanford (1982). Criticism and influence analysis in regression, in S. Leinhardt (ed.),Sociological Methodology, pp. 313–361.
Cox, D. R. and Snell, E. J. (1981).Applied Statistics, Principles and Examples, London: Chapman and Hall.
Daniel, Cuthbert T. and Wood, Fred S. (1980).Fitting Equations to Data, New York: John Wiley & Sons.
Draper, Norman R. and John, J. A. (1981). Influential observations and outliers in regression,Technometrics 23(1), 21–26.
Draper, Norman R. and Smith, Jr,. Harry. (1981).Applied Regression Analysis, second edition, New York.
Emerson, John D. (1983). Mathematical aspects of transformations, in David Hoaglinet al. Op. cit. 247–282.
Emerson, John D. and Hoaglin, David C. (1983). Stem-and-leaf displays, in David Hoaglin, Frederick Mosteller, and John Tukey (eds.),Understanding Robust and Exploratory Data Analysis, New York: John Wiley.
Emerson, John D. and Strenio, Judith. (1983). Boxplots and batch comparison, in David Hoaglinet al., Op. cit., pp. 58–96.
Emerson, John D. and Stoto, Michael (1983). Transforming data, in David Hoaglinet al., Op. cit., pp. 97–128.
Gentleman, J. F. and Wilk, M. B. (1975). Detecting outliers. II. Supplementing the direct analysis of residuals,Biometrics 31, 387–410.
Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum of matrices,SIAM Review 23, 53–60.
Hoaglin, David C. (1983). Letter Values: A set of selected order statistics, in David Hoaglinet al., Op. cit., pp. 33–57.
Hoaglin, David, Mosteller, Frederick, and Tukey, John (eds.), (1983).Underwsanding Robust and Exploratory Data Analysis, New York: John Wiley.
Hoaglin. David D. and Welsch, Roy E. (1978). The hat matrix in regression and ANOVA,The American Statistician 32(1), 17–22.
Hoaglin, David C. and Kempthorne, Peter J. (1986) “Comment” to S. Chatterjee and A. S. Hadi. (1986). Influential observations, high leverage points, and outliers in linear regression,Statistical Science 1(3), 408–12.
Hocking, R. R. (1983). Developments in Linear Regression Methodology: 1959–1982,Technometrics 25(3), 219–30.
Huber, P. (1981).Robust Statistics, New York: John Wiley.
Jenkins, G. (1979).Practical Experiences with Modelling and Forecasting Time Series, Jersey, Channel Islands: GJP.
John, J. A. and Draper, N. R. (1978). On testing for two outliers or one outlier in two-way tables,Technometrics 20(1), 69–78.
Johnston, J. (1984).Econometric Methods, New York: McGraw-Hill Book Company.
Leinhardt, S. and Wasserman, S. S. (1979). Exploratory Data Analysis: An Introduction to Selected Methods, in K. F. Schuessler (ed.),Sociological Methodology 311–365.
Lund, Richard E. (1975). Tables for an approximate test for outliers in linear models,Technometrics 17(4), 473–476.
Montgomery, Douglas C., Martin, Edith W., and Peck, Elizabeth A. (1980). Interior analysis of the observations in multiple linear regression,Journal of Quality Technology 12(3), 165–173.
Mosteller, Frederick and Tukey, John (1977).Data Analysis and Regression, Reading, Mass.: Addison-Wesley Publishing Co.
Prescott, P. (1975). An approximate test for outliers in linear models,Technometrics (17)1, 129–132.
Rao, C. R. (1973).Linear Statistical Inference and its Applications, New York: John Wiley & Sons.
Tietjen, G. L., Moore, R. H., and Beckman, R. J. (1973). Testing for a single outlier in simple linear regression,Technometrics 15(4), 717–721.
Tukey, John (1977).Explortory Data Analysis, Reading, Mass.: Addison-Wesley Publishing Co.
Velleman, Paul F. and Welsch, Roy E. (1981). Efficient computing of regression diagnostics,The American Statistician 35(4), 234–242.
Walker, Esteban and Birch, Jeffrey (1988). Influence measures in ridge regression,Technometrics 30(2), 221–7.
Weisberg, Sanford (1980).Applied Linear Regression, New York: John Wiley and Sons.
Weisberg, Sanford (1983). Discussion to R. R. Hocking, Developments in linear regression methodology: 1959–1982,Technometrics 25(3), 240–4.
Welsch, Roy E. and Kuh, Edwin (1977). Linear regression diagnostics, Sloan School of Managmeent Working Paper, Masachussetts Institute of Technology, Cambridge, Mass., pp. 923–77.
Welsch, Roy E. (1986). “Comment” to S. Chatterjee and A. S. Hadi, Influential observations, high points, and outliers in linear regression,Statistical Science 1(3), 403–5.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Franzosi, R. Outside and inside the regression “black box” from exploratory to interior data analysis. Qual Quant 28, 21–53 (1994). https://doi.org/10.1007/BF01098725
Issue Date:
DOI: https://doi.org/10.1007/BF01098725