Abstract
One of the thorniest aspects of cluster analysis continues to be the weighting and selection of variables. This paper reports on the performance of nine methods on eight “leading case” simulated and real sets of data. The results demonstrate shortcomings of weighting based on the standard deviation or range as well as other more complex schemes in the literature. Weighting schemes based upon carefully chosen estimates of within-cluster and between-cluster variability are generally more effective. These estimates do not require knowledge of the cluster structure. Additional research is essential: worry-free approaches do not yet exist.
Similar content being viewed by others
References
ANDREWS, D. F., and HERZBERG, A. M. (1985),Data: A Collection of Problems from Many Fields for the Student and Research Worker, New York: Springer-Verlag.
ART, D., GNANADESIKAN, R., and KETTENRING, J. R. (1982), “Data-Based Metrics for Cluster Analysis,”Utilitas Mathematica, 21A, 75–99.
BATCHELOR, B. G. (1978),Pattern Recognition: Ideas in Practice, New York: Plenum
DESARBO, W. S., CARROLL, J. D., CLARK, L. A., and GREEN, P. E. (1984), “Synthesized Clustering: A Method for Amalgamating Clustering Bases with Differential Weighting of Variables,”Psychometrika, 49, 57–78.
DE SOETE, G. (1986), “Optimal Variable Weighting for Ultrametric and Additive Tree Clustering,”Quality and Quantity, 20, 169–180.
DE SOETE, G. (1988), “OVWTRE: A Program for Optimal Variable Weighting for Ultrametric and Additive Tree Fitting,”Journal of Classification, 5, 101–104.
DE SOETE, G., DESARBO, W. S., and CARROLL, J. D. (1985), “Optimal Variable Weighting for Hierarchical Clustering: An Alternating Least Squares Algorithm,”Journal of Classification, 2, 173–192.
DUFFY, D. E., and QUIROZ, A. J. (1991), “A Permutation-Based Algorithm for Block Clustering,”Journal of Classification, 8, 65–91.
FINNEY, D. J. (1956), “Multivariate Analysis and Agricultural Experiments,”Biometrics, 12, 67–71.
FOWLKES, E. B., GNANADESIKAN, R., and KETTENRING, J. R. (1987), “Variable Selection in Clustering and Other Contexts,” inDesign, Data, and Analysis, by Some Friends of Cuthbert Daniel, C. L. Mallows, New York: Wiley, 13–34.
FOWLKES, E. B., GNANADESIKAN, R., and KETTENRING, J.R. (1988), “Variable Selection in Clustering,”Journal of Classification, 5, 205–228.
FRIEDMAN, H.. P. and Rubin, J. (1967), “On Some Invariant Criteria for Grouping Data,”Journal of the American Statistical Association, 62, 1159–1178.
FUKUNAGA, K. (1972),Introduction to Statistical Pattern Recognition, New York: Academic Press.
GNANADESIKAN, R. (1977),Methods for Statistical Data Analysis of Multivariate Observations, New York: Wiley.
GNANADESIKAN, R., HARVEY, J. W., and KETTENRING, J. R. (1993), “Mahalanobis Metrics for Cluster Analysis,”Sankhya A, 55, 494–505.
GORDON, A. D. (1981),Classification: Methods for Exploratory' Analysis of Multivariate Data, New York: Chapman and Hall.
GORDON, A. D. (1987), “A Review of Hierarchical Classification,”Journal of the Royal Statistical Society A, 150, 119–137.
GORDON, A. D. (1990), “Constructing Dissimilarity Measures,”Journal of Classification, 7, 257–269.
GREEN, P. E., CARMONE, F. J., and KIM, J. (1990), “A Preliminary Study of Optimal Variable Weighting ink-Means Clustering,”Journal of Classification, 7, 271–285.
HANSEN, K. M., and TUKEY, J. W. (1992), “Tuning a Major Part of a Clustering Algorithm,”International Statistical Review, 60, 21–43.
HARTIGAN, J. (1972), “Direct Clustering of a Data Matrix,”Journal of the American Statistical Association, 67, 123–129.
KAUFMAN, L., and ROUSSEEUW, P. J. (1990),Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.
KRUSKAL, J. B. (1964a), “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis,”Psychometrika, 29, 1–27.
KRUSKAL, J. B. (1964b), “Nonmetric Multidimensional Scaling: A Numerical Method,”Psychometrika, 29, 115–129.
MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,” inProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1), Eds., L.M. Le Cam & J. Neyman, Berkeley: University of California Press, 281–297.
MILLIGAN, G. W. (1989), “A Validation Study of a Variable Weighting Algorithm for Cluster Analysis,”Journal of Classification, 6, 53–71.
MURTAGH, F. (1991), “Review ofAdaptive Pattern Recognition and Neural Networks by Pao andNeural Networks in Artificial Intelligence by Zeidenberg,”Journal of Classification, 8, 115–119.
MILLIGAN, G. W. and COOPER, M. C. (1988), “A Study of Standardization of Variables in Cluster Analysis,”Journal of Classification, 5, 181–204.
RIPLEY, B. D. (1993), “Statistical Aspects of Neural Networks,” inNetworks and Chaos-Statistical and Probabilistic Aspects, Eds., O.E. Barndorff-Nielsen, J.L. Jensen, and W.S. Kendall, New York: Chapman and Hall, 40–123.
SOKAL, R. R., and ROHLF, F. J. (1980), “An Experiment in Taxonomic Judgment,”Systematic Botany, 5, 341–365.
SPÄTH, H. (1980),Cluster Analysis Algorithms, Chichester: Ellis Horwood.
WARD, J. H., Jr. (1963), “Hierarchical Grouping to Optimize an Objective Function,”Journal of the American Statistical Association, 58, 236–244.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gnanadesikan, R., Kettenring, J.R. & Tsao, S.L. Weighting and selection of variables for cluster analysis. Journal of Classification 12, 113–136 (1995). https://doi.org/10.1007/BF01202271
Issue Date:
DOI: https://doi.org/10.1007/BF01202271