Skip to main content
Log in

A study of standardization of variables in cluster analysis

  • Authors Of Articles
  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization complicates the decision process. The present simulation study examined the standardization problem. A variety of data structures were generated which varied the intercluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimensions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditionalz-score transformation was found to be less effective in several situations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • ANDERBERG, M.R. (1973),Cluster Analysis for Applications, New York: Academic Press.

    Google Scholar 

  • BAYNE, C.K., BEAUCHAMP, J.J., BEGOVICH, C.L., and KANE, V.E. (1980), “Monte Carlo Comparisons of Selected Clustering Procedures,”Pattern Recognition, 12, 51–62.

    Google Scholar 

  • BLASHFIELD, R.K. (1976), “Mixture Model Tests of Cluster Analysis: Accuracy of Four Agglomerative Hierarchical Methods,”Psychological Bulletin, 83, 377–388.

    Google Scholar 

  • BLASHFIELD, R.K. (1977), “The Equivalence of Three Statistical Packages for Performing Hierarchical Cluster Analysis,”Psychometrika, 42, 429–431.

    Google Scholar 

  • BURR, E.J. (1968), “Clustering Sorting with Mixed Character Types: I. Standardization of Character Values,”Australian Computer Journal, 1, 97–99.

    Google Scholar 

  • CAIN, A.J., and HARRISON, G.A. (1958), “An Analysis of the Taxonomist's Judgement of Affinity,”Proceedings of the Zoological Society of London, 131, 85–98.

    Google Scholar 

  • CARMICHAEL, J.W., GEORGE, J.A., and JULIUS, R.S. (1968), “Finding Natural Clusters,”Systematic Zoology, 17, 144–150.

    Google Scholar 

  • CONOVER, W.J., and IMAN, R.L. (1981), “Rank Transformation as a Bridge Between Parametric and Nonparametric Statistics,”The American Statistician, 35, 124–129.

    Google Scholar 

  • CORMACK, R.M. (1971), “A Review of Classification,”Journal of the Royal Statistical Society, Series A, 134, 321–367.

    Google Scholar 

  • DE SOETE, G., DESARBO, W.S., and CARROLL, J.D. (1985), “Optimal Variable Weighting for Hierarchical Clustering: An Alternating Least-Squares Algorithm,”Journal of Classification, 2, 173–192.

    Google Scholar 

  • DUBES, R., and JAIN, A.K. (1980), “Clustering Methodologies in Exploratory Data Analysis,”Advances in Computers, 19, 113–228.

    Google Scholar 

  • EDELBROCK, C. (1979), “Comparing the Accuracy of Hierarchical Clustering Algorithms: The Problem of Classifying Everybody,”Multivariate Behavioral Research, 14, 367–384.

    Google Scholar 

  • EVERITT, B.S. (1980),Cluster Analysis (2nd ed.), London: Heinemann.

    Google Scholar 

  • FLEISS, J.L., and ZUBIN, J. (1969), “On the Methods and Theory of Clustering,”Multivariate Behavioral Research, 4, 235–250.

    Google Scholar 

  • GORDON, A.D. (1981),Classification: Methods for the Exploratory Analysis of Multivariate Data, London: Chapman and Hall.

    Google Scholar 

  • GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of Its Properties,”Biometrics, 27, 857–871.

    Google Scholar 

  • HALL, A.V. (1965), “The Peculiarity Index, a New Function for Use in Numerical Taxonomy,”Nature, 206, 952.

    Google Scholar 

  • HALL, A.V. (1969), “Group Forming and Discrimination with Homogeneity Functions,” inNumerical Taxonomy, ed. A.J. Cole, New York: Academic Press.

    Google Scholar 

  • HARTIGAN, J.A. (1975),Clustering Algorithms, New York: Wiley.

    Google Scholar 

  • HOHENEGGER, J. (1986), “Weighted Standardization — A General Data Transformation Method Preceeding Classification Procedures,”Biometrical Journal, 28, 295–303.

    Google Scholar 

  • HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions,”Journal of Classification, 2, 193–218.

    Google Scholar 

  • JARDINE, N., and SIBSON, R. (1971),Mathematical Taxonomy, New York: Wiley.

    Google Scholar 

  • JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.

    PubMed  Google Scholar 

  • KAUFMAN, R.L. (1985), “Issues in Multivariate Cluster Analysis: Some Simulation Results,”Sociological Methods and Research, 13, 467–486.

    Google Scholar 

  • LANCE, G.N., and WILLIAMS, W.T. (1967), “Mixed Data Classificatory Programs: I. Agglomerative Systems,”Australian Computer Journal, 1, 15–20.

    Google Scholar 

  • LORR, M. (1983),Cluster Analysis for the Social Sciences, San Francisco: Jossey-Bass.

    Google Scholar 

  • MILLIGAN, G.W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,”Psychometrika, 45, 325–342.

    Google Scholar 

  • MILLIGAN, G.W. (1981), “A Review of Monte Carlo Tests of Cluster Analysis,”Multivariate Behavioral Research, 16, 379–407.

    Google Scholar 

  • MILLIGAN, G.W. (1985), “An Algorithm for Generating Artificial Test Clusters,”Psychometrika, 50, 123–127.

    Google Scholar 

  • MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis,”Multivariate Behavioral Research, 21, 441–458.

    Google Scholar 

  • MILLIGAN, G.W., and COOPER, M.C. (1987), “Methodological Review: Clustering Methods,”Applied Psychological Measurement, 11, 329–354.

    Google Scholar 

  • MORRISON, D.G. (1967), “Measurement Problems in Cluster Analysis,”Management Science, 13, 775–780.

    Google Scholar 

  • OVERALL, J.E., and KLETT, C.J. (1972),Applied Multivariate Analysis, New York: McGraw-Hill.

    Google Scholar 

  • RAMSEY, P.H. (1978), “Power Differences Between Pairwise Multiple Comparisons,”Journal of the American Statistical Association, 73, 479–487.

    Google Scholar 

  • ROMESBURG, H.C. (1984),Cluster Analysis for Researchers, Belmont, CA: Lifetime Learning Publications.

    Google Scholar 

  • SAS User's Guide: Statistics, (1985), Cary, NC: SAS Institute.

  • SAWERY, W.L., KELLER, L., and CONGER, J.J. (1960), “An Objective Method of Grouping Profiles by Distance Functions and Its Relation to Factor Analysis,”Educational and Psychological Measurement, 20, 651–674.

    Google Scholar 

  • SCHEIBLER, D., and SCHNEIDER, W. (1985), “Monte Carlo Tests of the Accuracy of Cluster Analysis Algorithms — A Comparison of Hierarchical and Nonhierarchical Methods,”Multivariate Behavioral Research, 20, 283–304.

    Google Scholar 

  • SNEATH, P.H.A., and SOKAL, R.R. (1973),Numerical Taxonomy, San Francisco: Freeman.

    Google Scholar 

  • SOKAL, R.R. (1961), “Distance as a Measure of Taxonomic Similarity,”Systematic Zoology, 10, 70–79.

    Google Scholar 

  • SOKAL, R.R., and ROHLF, F.J. (1969),Biometry, the Principles and Practice of Statistics in Biological Research, San Francisco: Freeman.

    Google Scholar 

  • SPATH, H. (1980),Cluster Analysis Algorithms, New York: Wiley.

    Google Scholar 

  • STODDARD, A.M. (1979), “Standardization of Measures Prior to Cluster Analysis,”Biometrics, 35, 765–773.

    Google Scholar 

  • TUKEY, J.W. (1977),Exploratory Data Analysis, Reading, Ma.: Addison-Wesley.

    Google Scholar 

  • WILLIAMS, W.T., DALE, M.B., and MAC NAUGHTON-SMITH, P. (1964), “An Objective Method of Weighting in Similarity Analysis,”Nature, 201, 426.

    Google Scholar 

  • WILLIAMS, W.T., LAMBERT, J.M., and LANCE, G.N. (1966), “Multivariate Methods in Plant Ecology. V. Similarity Analyses and Information Analysis,”Journal of Ecology, 54, 427–445.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Milligan, G.W., Cooper, M.C. A study of standardization of variables in cluster analysis. Journal of Classification 5, 181–204 (1988). https://doi.org/10.1007/BF01897163

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01897163

Keywords

Navigation