Skip to main content

Advertisement

Log in

A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

  • Theory and Methods
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Al-Sultan, K. (1995). A tabu search approach to the clustering problem. Pattern Recognition, 28, 1443–1451.

    Article  Google Scholar 

  • Anderson, E. (1935). The irises of the Gaspé peninsula. Bulletin of the American Iris Society, 59, 2–5.

    Google Scholar 

  • Arabie, P., & Hubert, L. (1992). Combinatorial data analysis. Annual Review of Psychology, 43, 169–203.

    Article  Google Scholar 

  • Arabie, P., & Hubert, L.J. (1996). An overview of combinatorial data analysis. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 5–63). River Edge: World Scientific.

    Google Scholar 

  • Babu, G.P., & Murty, M.N. (1993). A near optimal initial seed value selection in k-means algorithm using genetic algorithms. Pattern Recognition Letters, 14, 763–769.

    Article  Google Scholar 

  • Babu, G.P., & Murty, M.N. (1994). Simulated annealing for selecting optimal initial seeds in the K-means algorithm. Indian Journal of Pure and Applied Mathematics, 25, 85–94.

    Google Scholar 

  • Banfield, C.F., & Bassil, L.C. (1977). A transfer algorithm for nonhierarchical classification. Applied Statistics, 26, 206–210.

    Article  Google Scholar 

  • Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.

    Article  Google Scholar 

  • Belew, R.K., & Booker, J.B. (Eds.). (1991). Proceedings of the fourth international conference on genetic algorithms. San Mateo: Morgan-Kaufmann.

    Google Scholar 

  • Brucker, F. (1978). On the complexity of clustering problems. In M. Beckmann & H.P. Kunzi (Eds.), Optimization and operations research (pp. 45–54). Heidelberg: Springer.

    Google Scholar 

  • Brusco, M.J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510–523.

    Article  PubMed  Google Scholar 

  • Brusco, M.J. (2006). A repetitive branch-and-bound procedure for minimum within-cluster sums of squares partitioning. Psychometrika, 71, 347–363.

    Article  Google Scholar 

  • Brusco, M.J., & Cradit, J.D. (2001). A variable selection heuristic for k-means cluster analysis. Psychometrika, 66, 249–270.

    Article  Google Scholar 

  • Brusco, M.J., Cradit, J.D., & Tashchian, A. (2003). Multicriterion clusterwise regression for joint segmentation settings: An application to customer value. Journal of Marketing Research, 40, 225–234.

    Article  Google Scholar 

  • Cerny, V. (1985). A thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications, 45, 41–51.

    Article  Google Scholar 

  • Day, W.H.E. (1996). Complexity theory: An introduction for practitioners of classification. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 199–233). River Edge: World Scientific.

    Google Scholar 

  • Diehr, G. (1985). Evaluation of a branch and bound algorithm for clustering. SIAM Journal for Scientific and Statistical Computing, 6, 268–284.

    Article  Google Scholar 

  • Dimitriadou, E., Dolniĉar, S., & Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67, 137–160.

    Article  Google Scholar 

  • du Merle, O., Hansen, P., Jaumard, B., & Mladenoviĉ, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21, 1485–1505.

    Article  Google Scholar 

  • Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.

    Google Scholar 

  • Forgy, E.W. (1965). Cluster analyses of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768.

    Google Scholar 

  • Forrest, S. (Ed.). (1993). Proceedings of the fifth international conference on genetic algorithms. San Mateo: Morgan-Kaufmann.

    Google Scholar 

  • Glover, F. (1989). Tabu search—Part I. ORSA Journal on Computing, 1, 190–206.

    Google Scholar 

  • Glover, F. (1990). Tabu search—Part II. ORSA Journal on Computing, 2, 4–32.

    Google Scholar 

  • Glover, F., & Laguna, M. (1993). Tabu search. In C. Reeves (Ed.), Modern heuristic techniques for combinatorial problems (pp. 70–141). Oxford: Blackwell.

    Google Scholar 

  • Glover, F., Taillard, E., & Werra, D. (1993). A user’s guide to tabu search. Annals of Operations Research, 41, 3–28.

    Article  Google Scholar 

  • Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. New York: Addison-Wesley.

    Google Scholar 

  • Grötschel, M., & Holland, O. (1991). Solution of large-scale symmetric traveling salesman problems. Mathematical Programming, 51, 141–202.

    Article  Google Scholar 

  • Hair, J.F., Anderson, R.E., Tatham, R.L., & Black, W.C. (1998). Multivariate data analysis (5th ed.). Saddle River: Prentice-Hall.

    Google Scholar 

  • Hand, D.J. (1981). Discrimination and classification. New York: Wiley.

    Google Scholar 

  • Hand, D.J., & Krzanowski, W.J. (2005). Optimising k-means clustering results with standard software packages. Computational Statistics and Data Analysis, 49, 969–973.

    Article  Google Scholar 

  • Hansen, P., & Mladenoviĉ, N. (2001). J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34, 405–413.

    Article  Google Scholar 

  • Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.

    Google Scholar 

  • Hartigan, J.A., & Wong, M.A. (1979). Algorithm AS136: A k-means clustering program. Applied Statistics, 28, 100–128.

    Article  Google Scholar 

  • Heinz, G., Peterson, L.J., Johnson, R.W., & Kerk, C.J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, www.amstat.org/publications/jse/v11n2/datasets.heinz.html.

  • Holland, J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.

    Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.

    Article  Google Scholar 

  • Hubert, L., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: Society Industrial and Applied Mathematics.

    Google Scholar 

  • Jancey, R.C. (1966). Multidimensional group analysis. Australian Journal of Botany, 14, 127–130.

    Article  Google Scholar 

  • Jones, D.R., & Beltramo, M.A. (1991). Solving partitioning problems with genetic algorithms. In R.K. Belew & J.B. Booker (Eds.), Proceedings of the fourth international conference on genetic algorithms. San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Kirkpatrick, S., Gelatt, C.D., & Vecchi, M.P. (1983). Optimization by simulated annealing. Science, 220, 671–680.

    Article  PubMed  Google Scholar 

  • Klein, R.W., & Dubes, R.C. (1989). Experiments in projection and clustering by simulated annealing. Pattern Recognition, 22, 213–220.

    Article  Google Scholar 

  • Koontz, W.L.G., Narendra, P.M., & Fukunaga, K. (1975). A branch and bound clustering algorithm. IEEE Transaction on Computers, C-24, 908–915.

    Article  Google Scholar 

  • Krishna, K., & Murty, N.M. (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man, & Cybernetics—Part B: Cybernetics, 29, 433–439.

    Article  Google Scholar 

  • MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In L.M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). Berkeley: University of California Press.

  • Maronna, R., & Jacovkis, P.M. (1974). Multivariate clustering procedures with variable metrics. Biometrics, 30, 499–505.

    Article  Google Scholar 

  • MathWorks, Inc. (2002). Using MATLAB (version 6). Natick: The MathWorks, Inc.

    Google Scholar 

  • Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33, 1455–1465.

    Article  Google Scholar 

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.

    Article  Google Scholar 

  • Milligan, G.W. (1980a). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.

    Article  Google Scholar 

  • Milligan, G.W. (1980b). The validation of four ultrametric clustering algorithms. Pattern Recognition, 12, 41–50.

    Article  Google Scholar 

  • Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.

    Article  Google Scholar 

  • Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.

    Article  Google Scholar 

  • Milligan, G.W., & Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21, 441–458.

    Article  Google Scholar 

  • Milligan, G.W., & Cooper, M.C. (1988). A study of variable standardization. Journal of Classification, 5, 181–204.

    Article  Google Scholar 

  • Pacheco, J., & Valencia, O. (2003). Design of hybrids for the minimum sum-of-squares clustering problem. Computational Statistics and Data Analysis, 43, 235–248.

    Google Scholar 

  • Pal, S.K., & Majumder, D.D. (1977). Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Transactions on Systems, Man, and Cybernetics, 7, 625–629.

    Article  Google Scholar 

  • Reinelt, G. (2001). TSPLIB. http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95.

  • Scott, A.J., & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–398.

    Article  Google Scholar 

  • Selim, S.Z., & Al-Sultan, K. (1991). A simulated annealing algorithm for the clustering problem. Pattern Recognition, 24, 1003–1008.

    Article  Google Scholar 

  • Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. New York: Wiley.

    Google Scholar 

  • Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.

    Article  PubMed  Google Scholar 

  • Steinley, D., & Brusco, M. (2007). Initializing K-means batch clustering: A critical analysis of several techniques. Journal of Classification, 24, 99–121.

    Article  Google Scholar 

  • Steinley, D., & Henson, R. (2005). OCLUS: An algorithmic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.

    Article  Google Scholar 

  • Sun, L.-X., Xie, Y.-L., Song, X.-H., Wang, J.-H., & Yu, R.-Q. (1994a). Cluster analysis by simulated annealing. Computers in Chemistry, 18, 103–108.

    Article  Google Scholar 

  • Sun, L.-X., Xu, F., Liang, Y.-Z., Xie, Y.-L., & Yu, R.-Q. (1994b). Cluster analysis by the K-means algorithm and simulated annealing. Chemometrics and Intelligent Laboratory Systems, 25, 51–60.

    Article  Google Scholar 

  • Sung, C.S., & Jin, H.W. (2000). A tabu-search-based heuristic for clustering. Pattern Recognition, 33, 849–858.

    Article  Google Scholar 

  • van Os, B.J., & Meulman, J.J. (2004). Improving dynamic programming strategies for partitioning. Journal of Classification, 21, 207–230.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Brusco.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brusco, M.J., Steinley, D. A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning. Psychometrika 72, 583–600 (2007). https://doi.org/10.1007/s11336-007-9013-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-007-9013-4

Keywords

Navigation