Abstract
Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.
Similar content being viewed by others
References
Al-Sultan, K. (1995). A tabu search approach to the clustering problem. Pattern Recognition, 28, 1443–1451.
Anderson, E. (1935). The irises of the Gaspé peninsula. Bulletin of the American Iris Society, 59, 2–5.
Arabie, P., & Hubert, L. (1992). Combinatorial data analysis. Annual Review of Psychology, 43, 169–203.
Arabie, P., & Hubert, L.J. (1996). An overview of combinatorial data analysis. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 5–63). River Edge: World Scientific.
Babu, G.P., & Murty, M.N. (1993). A near optimal initial seed value selection in k-means algorithm using genetic algorithms. Pattern Recognition Letters, 14, 763–769.
Babu, G.P., & Murty, M.N. (1994). Simulated annealing for selecting optimal initial seeds in the K-means algorithm. Indian Journal of Pure and Applied Mathematics, 25, 85–94.
Banfield, C.F., & Bassil, L.C. (1977). A transfer algorithm for nonhierarchical classification. Applied Statistics, 26, 206–210.
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Belew, R.K., & Booker, J.B. (Eds.). (1991). Proceedings of the fourth international conference on genetic algorithms. San Mateo: Morgan-Kaufmann.
Brucker, F. (1978). On the complexity of clustering problems. In M. Beckmann & H.P. Kunzi (Eds.), Optimization and operations research (pp. 45–54). Heidelberg: Springer.
Brusco, M.J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510–523.
Brusco, M.J. (2006). A repetitive branch-and-bound procedure for minimum within-cluster sums of squares partitioning. Psychometrika, 71, 347–363.
Brusco, M.J., & Cradit, J.D. (2001). A variable selection heuristic for k-means cluster analysis. Psychometrika, 66, 249–270.
Brusco, M.J., Cradit, J.D., & Tashchian, A. (2003). Multicriterion clusterwise regression for joint segmentation settings: An application to customer value. Journal of Marketing Research, 40, 225–234.
Cerny, V. (1985). A thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications, 45, 41–51.
Day, W.H.E. (1996). Complexity theory: An introduction for practitioners of classification. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 199–233). River Edge: World Scientific.
Diehr, G. (1985). Evaluation of a branch and bound algorithm for clustering. SIAM Journal for Scientific and Statistical Computing, 6, 268–284.
Dimitriadou, E., Dolniĉar, S., & Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67, 137–160.
du Merle, O., Hansen, P., Jaumard, B., & Mladenoviĉ, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21, 1485–1505.
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.
Forgy, E.W. (1965). Cluster analyses of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768.
Forrest, S. (Ed.). (1993). Proceedings of the fifth international conference on genetic algorithms. San Mateo: Morgan-Kaufmann.
Glover, F. (1989). Tabu search—Part I. ORSA Journal on Computing, 1, 190–206.
Glover, F. (1990). Tabu search—Part II. ORSA Journal on Computing, 2, 4–32.
Glover, F., & Laguna, M. (1993). Tabu search. In C. Reeves (Ed.), Modern heuristic techniques for combinatorial problems (pp. 70–141). Oxford: Blackwell.
Glover, F., Taillard, E., & Werra, D. (1993). A user’s guide to tabu search. Annals of Operations Research, 41, 3–28.
Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. New York: Addison-Wesley.
Grötschel, M., & Holland, O. (1991). Solution of large-scale symmetric traveling salesman problems. Mathematical Programming, 51, 141–202.
Hair, J.F., Anderson, R.E., Tatham, R.L., & Black, W.C. (1998). Multivariate data analysis (5th ed.). Saddle River: Prentice-Hall.
Hand, D.J. (1981). Discrimination and classification. New York: Wiley.
Hand, D.J., & Krzanowski, W.J. (2005). Optimising k-means clustering results with standard software packages. Computational Statistics and Data Analysis, 49, 969–973.
Hansen, P., & Mladenoviĉ, N. (2001). J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34, 405–413.
Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.
Hartigan, J.A., & Wong, M.A. (1979). Algorithm AS136: A k-means clustering program. Applied Statistics, 28, 100–128.
Heinz, G., Peterson, L.J., Johnson, R.W., & Kerk, C.J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
Holland, J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Hubert, L., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: Society Industrial and Applied Mathematics.
Jancey, R.C. (1966). Multidimensional group analysis. Australian Journal of Botany, 14, 127–130.
Jones, D.R., & Beltramo, M.A. (1991). Solving partitioning problems with genetic algorithms. In R.K. Belew & J.B. Booker (Eds.), Proceedings of the fourth international conference on genetic algorithms. San Mateo: Morgan Kaufmann.
Kirkpatrick, S., Gelatt, C.D., & Vecchi, M.P. (1983). Optimization by simulated annealing. Science, 220, 671–680.
Klein, R.W., & Dubes, R.C. (1989). Experiments in projection and clustering by simulated annealing. Pattern Recognition, 22, 213–220.
Koontz, W.L.G., Narendra, P.M., & Fukunaga, K. (1975). A branch and bound clustering algorithm. IEEE Transaction on Computers, C-24, 908–915.
Krishna, K., & Murty, N.M. (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man, & Cybernetics—Part B: Cybernetics, 29, 433–439.
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In L.M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). Berkeley: University of California Press.
Maronna, R., & Jacovkis, P.M. (1974). Multivariate clustering procedures with variable metrics. Biometrics, 30, 499–505.
MathWorks, Inc. (2002). Using MATLAB (version 6). Natick: The MathWorks, Inc.
Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33, 1455–1465.
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.
Milligan, G.W. (1980a). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.
Milligan, G.W. (1980b). The validation of four ultrametric clustering algorithms. Pattern Recognition, 12, 41–50.
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.
Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.
Milligan, G.W., & Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21, 441–458.
Milligan, G.W., & Cooper, M.C. (1988). A study of variable standardization. Journal of Classification, 5, 181–204.
Pacheco, J., & Valencia, O. (2003). Design of hybrids for the minimum sum-of-squares clustering problem. Computational Statistics and Data Analysis, 43, 235–248.
Pal, S.K., & Majumder, D.D. (1977). Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Transactions on Systems, Man, and Cybernetics, 7, 625–629.
Reinelt, G. (2001). TSPLIB. http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95.
Scott, A.J., & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–398.
Selim, S.Z., & Al-Sultan, K. (1991). A simulated annealing algorithm for the clustering problem. Pattern Recognition, 24, 1003–1008.
Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. New York: Wiley.
Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.
Steinley, D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.
Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.
Steinley, D., & Brusco, M. (2007). Initializing K-means batch clustering: A critical analysis of several techniques. Journal of Classification, 24, 99–121.
Steinley, D., & Henson, R. (2005). OCLUS: An algorithmic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.
Sun, L.-X., Xie, Y.-L., Song, X.-H., Wang, J.-H., & Yu, R.-Q. (1994a). Cluster analysis by simulated annealing. Computers in Chemistry, 18, 103–108.
Sun, L.-X., Xu, F., Liang, Y.-Z., Xie, Y.-L., & Yu, R.-Q. (1994b). Cluster analysis by the K-means algorithm and simulated annealing. Chemometrics and Intelligent Laboratory Systems, 25, 51–60.
Sung, C.S., & Jin, H.W. (2000). A tabu-search-based heuristic for clustering. Pattern Recognition, 33, 849–858.
van Os, B.J., & Meulman, J.J. (2004). Improving dynamic programming strategies for partitioning. Journal of Classification, 21, 207–230.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Brusco, M.J., Steinley, D. A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning. Psychometrika 72, 583–600 (2007). https://doi.org/10.1007/s11336-007-9013-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-007-9013-4