A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Brusco, Michael J.; Steinley, Douglas

doi:10.1007/s11336-007-9013-4

A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Theory and Methods
Published: 11 July 2007

Volume 72, pages 583–600, (2007)
Cite this article

Psychometrika Aims and scope Submit manuscript

Michael J. Brusco¹ &
Douglas Steinley¹

487 Accesses
41 Citations
Explore all metrics

Abstract

Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Multi-objective generalized normal distribution optimization: a novel algorithm for multi-objective problems

Article Open access 08 May 2024

References

Al-Sultan, K. (1995). A tabu search approach to the clustering problem. Pattern Recognition, 28, 1443–1451.
Article Google Scholar
Anderson, E. (1935). The irises of the Gaspé peninsula. Bulletin of the American Iris Society, 59, 2–5.
Google Scholar
Arabie, P., & Hubert, L. (1992). Combinatorial data analysis. Annual Review of Psychology, 43, 169–203.
Article Google Scholar
Arabie, P., & Hubert, L.J. (1996). An overview of combinatorial data analysis. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 5–63). River Edge: World Scientific.
Google Scholar
Babu, G.P., & Murty, M.N. (1993). A near optimal initial seed value selection in k-means algorithm using genetic algorithms. Pattern Recognition Letters, 14, 763–769.
Article Google Scholar
Babu, G.P., & Murty, M.N. (1994). Simulated annealing for selecting optimal initial seeds in the K-means algorithm. Indian Journal of Pure and Applied Mathematics, 25, 85–94.
Google Scholar
Banfield, C.F., & Bassil, L.C. (1977). A transfer algorithm for nonhierarchical classification. Applied Statistics, 26, 206–210.
Article Google Scholar
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Article Google Scholar
Belew, R.K., & Booker, J.B. (Eds.). (1991). Proceedings of the fourth international conference on genetic algorithms. San Mateo: Morgan-Kaufmann.
Google Scholar
Brucker, F. (1978). On the complexity of clustering problems. In M. Beckmann & H.P. Kunzi (Eds.), Optimization and operations research (pp. 45–54). Heidelberg: Springer.
Google Scholar
Brusco, M.J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510–523.
Article PubMed Google Scholar
Brusco, M.J. (2006). A repetitive branch-and-bound procedure for minimum within-cluster sums of squares partitioning. Psychometrika, 71, 347–363.
Article Google Scholar
Brusco, M.J., & Cradit, J.D. (2001). A variable selection heuristic for k-means cluster analysis. Psychometrika, 66, 249–270.
Article Google Scholar
Brusco, M.J., Cradit, J.D., & Tashchian, A. (2003). Multicriterion clusterwise regression for joint segmentation settings: An application to customer value. Journal of Marketing Research, 40, 225–234.
Article Google Scholar
Cerny, V. (1985). A thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications, 45, 41–51.
Article Google Scholar
Day, W.H.E. (1996). Complexity theory: An introduction for practitioners of classification. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 199–233). River Edge: World Scientific.
Google Scholar
Diehr, G. (1985). Evaluation of a branch and bound algorithm for clustering. SIAM Journal for Scientific and Statistical Computing, 6, 268–284.
Article Google Scholar
Dimitriadou, E., Dolniĉar, S., & Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67, 137–160.
Article Google Scholar
du Merle, O., Hansen, P., Jaumard, B., & Mladenoviĉ, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21, 1485–1505.
Article Google Scholar
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.
Google Scholar
Forgy, E.W. (1965). Cluster analyses of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768.
Google Scholar
Forrest, S. (Ed.). (1993). Proceedings of the fifth international conference on genetic algorithms. San Mateo: Morgan-Kaufmann.
Google Scholar
Glover, F. (1989). Tabu search—Part I. ORSA Journal on Computing, 1, 190–206.
Google Scholar
Glover, F. (1990). Tabu search—Part II. ORSA Journal on Computing, 2, 4–32.
Google Scholar
Glover, F., & Laguna, M. (1993). Tabu search. In C. Reeves (Ed.), Modern heuristic techniques for combinatorial problems (pp. 70–141). Oxford: Blackwell.
Google Scholar
Glover, F., Taillard, E., & Werra, D. (1993). A user’s guide to tabu search. Annals of Operations Research, 41, 3–28.
Article Google Scholar
Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. New York: Addison-Wesley.
Google Scholar
Grötschel, M., & Holland, O. (1991). Solution of large-scale symmetric traveling salesman problems. Mathematical Programming, 51, 141–202.
Article Google Scholar
Hair, J.F., Anderson, R.E., Tatham, R.L., & Black, W.C. (1998). Multivariate data analysis (5th ed.). Saddle River: Prentice-Hall.
Google Scholar
Hand, D.J. (1981). Discrimination and classification. New York: Wiley.
Google Scholar
Hand, D.J., & Krzanowski, W.J. (2005). Optimising k-means clustering results with standard software packages. Computational Statistics and Data Analysis, 49, 969–973.
Article Google Scholar
Hansen, P., & Mladenoviĉ, N. (2001). J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34, 405–413.
Article Google Scholar
Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.
Google Scholar
Hartigan, J.A., & Wong, M.A. (1979). Algorithm AS136: A k-means clustering program. Applied Statistics, 28, 100–128.
Article Google Scholar
Heinz, G., Peterson, L.J., Johnson, R.W., & Kerk, C.J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
Holland, J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.
Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Article Google Scholar
Hubert, L., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: Society Industrial and Applied Mathematics.
Google Scholar
Jancey, R.C. (1966). Multidimensional group analysis. Australian Journal of Botany, 14, 127–130.
Article Google Scholar
Jones, D.R., & Beltramo, M.A. (1991). Solving partitioning problems with genetic algorithms. In R.K. Belew & J.B. Booker (Eds.), Proceedings of the fourth international conference on genetic algorithms. San Mateo: Morgan Kaufmann.
Google Scholar
Kirkpatrick, S., Gelatt, C.D., & Vecchi, M.P. (1983). Optimization by simulated annealing. Science, 220, 671–680.
Article PubMed Google Scholar
Klein, R.W., & Dubes, R.C. (1989). Experiments in projection and clustering by simulated annealing. Pattern Recognition, 22, 213–220.
Article Google Scholar
Koontz, W.L.G., Narendra, P.M., & Fukunaga, K. (1975). A branch and bound clustering algorithm. IEEE Transaction on Computers, C-24, 908–915.
Article Google Scholar
Krishna, K., & Murty, N.M. (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man, & Cybernetics—Part B: Cybernetics, 29, 433–439.
Article Google Scholar
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In L.M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). Berkeley: University of California Press.
Maronna, R., & Jacovkis, P.M. (1974). Multivariate clustering procedures with variable metrics. Biometrics, 30, 499–505.
Article Google Scholar
MathWorks, Inc. (2002). Using MATLAB (version 6). Natick: The MathWorks, Inc.
Google Scholar
Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33, 1455–1465.
Article Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.
Article Google Scholar
Milligan, G.W. (1980a). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.
Article Google Scholar
Milligan, G.W. (1980b). The validation of four ultrametric clustering algorithms. Pattern Recognition, 12, 41–50.
Article Google Scholar
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.
Article Google Scholar
Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.
Article Google Scholar
Milligan, G.W., & Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21, 441–458.
Article Google Scholar
Milligan, G.W., & Cooper, M.C. (1988). A study of variable standardization. Journal of Classification, 5, 181–204.
Article Google Scholar
Pacheco, J., & Valencia, O. (2003). Design of hybrids for the minimum sum-of-squares clustering problem. Computational Statistics and Data Analysis, 43, 235–248.
Google Scholar
Pal, S.K., & Majumder, D.D. (1977). Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Transactions on Systems, Man, and Cybernetics, 7, 625–629.
Article Google Scholar
Reinelt, G. (2001). TSPLIB. http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95.
Scott, A.J., & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–398.
Article Google Scholar
Selim, S.Z., & Al-Sultan, K. (1991). A simulated annealing algorithm for the clustering problem. Pattern Recognition, 24, 1003–1008.
Article Google Scholar
Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. New York: Wiley.
Google Scholar
Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.
Article PubMed Google Scholar
Steinley, D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
Article PubMed Google Scholar
Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.
Article PubMed Google Scholar
Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.
Article PubMed Google Scholar
Steinley, D., & Brusco, M. (2007). Initializing K-means batch clustering: A critical analysis of several techniques. Journal of Classification, 24, 99–121.
Article Google Scholar
Steinley, D., & Henson, R. (2005). OCLUS: An algorithmic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.
Article Google Scholar
Sun, L.-X., Xie, Y.-L., Song, X.-H., Wang, J.-H., & Yu, R.-Q. (1994a). Cluster analysis by simulated annealing. Computers in Chemistry, 18, 103–108.
Article Google Scholar
Sun, L.-X., Xu, F., Liang, Y.-Z., Xie, Y.-L., & Yu, R.-Q. (1994b). Cluster analysis by the K-means algorithm and simulated annealing. Chemometrics and Intelligent Laboratory Systems, 25, 51–60.
Article Google Scholar
Sung, C.S., & Jin, H.W. (2000). A tabu-search-based heuristic for clustering. Pattern Recognition, 33, 849–858.
Article Google Scholar
van Os, B.J., & Meulman, J.J. (2004). Improving dynamic programming strategies for partitioning. Journal of Classification, 21, 207–230.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Marketing, College of Business, Florida State University, Tallahassee, FL, 32306-1110, USA
Michael J. Brusco & Douglas Steinley

Authors

Michael J. Brusco
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Steinley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Brusco.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brusco, M.J., Steinley, D. A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning. Psychometrika 72, 583–600 (2007). https://doi.org/10.1007/s11336-007-9013-4

Download citation

Received: 20 July 2005
Revised: 08 December 2006
Published: 11 July 2007
Issue Date: December 2007
DOI: https://doi.org/10.1007/s11336-007-9013-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Multi-objective generalized normal distribution optimization: a novel algorithm for multi-objective problems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Multi-objective generalized normal distribution optimization: a novel algorithm for multi-objective problems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation