skip to main content
research-article

The effectiveness of lloyd-type methods for the k-means problem

Published:09 January 2013Publication History
Skip Abstract Section

Abstract

We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

References

  1. Aggarwal, A., Deshpande, A., and Kannan, R. 2009. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimazation Problems (APPROX 2009). 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Proceedings of the 23rd Annual Conference on Neural Information Processing System (NIPS). 10--18.Google ScholarGoogle Scholar
  3. Alsabti, K., Ranka, S., and Singh, V. 1998. An efficient k-means clustering algorithm. In Proceedings of the 1st Workshop on High Performance Data Mining.Google ScholarGoogle Scholar
  4. Arthur, D., Manthey, B., and Röglin, H. 2011. Smoothed analysis of the k-means method. J. ACM 58, 5, 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arthur, D., and Vassilvitskii, S. 2006. How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (SoCG). 144--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1027--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. 2004. Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33, 544--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Awasthi, P., Blum, A., and Sheffet, O. 2010. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS). 309--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC). 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Balcan, M., Blum, A., and Gupta, A. 2009. Approximate clustering without the approximation. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1068--1077. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bradley, P. S. and Fayyad, U. 1998. Refining initial points for K-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML). 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bunn, P. and Ostrovsky, R. 2007. Secure two-party k-means clustering. In Proceedings of the ACM Conference on Computer and Communications Security. 486--497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). 378--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Charikar, M., Guha, S., Tardos, É., and Shmoys, D. B. 2002. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65, 129--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chrobak, M., Kenyon, C., and Young, N. 2006. The reverse greedy algorithm for the metric k-median problem. Info. Proc. Lett. 97, 68--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cox, D. R. 1957. Note on grouping. J. ASA 52, 543--547.Google ScholarGoogle Scholar
  17. Dasgupta, S. 2003. How fast is k-means? In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). 735.Google ScholarGoogle Scholar
  18. de la Vega, W. F., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th ACM Symposium on Theory of Computing. 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Y. Stat. Soc. B 39, 1--38.Google ScholarGoogle Scholar
  20. Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 2004. Clustering large graphs via the Singular Value Decomposition. Mach. Learn. 56, 9--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Effros, M. and Schulman, L. J. 2004a. Deterministic clustering with data nets. Electronic Tech rep. ECCC TR04-050.Google ScholarGoogle Scholar
  23. Effros, M. and Schulman, L. J. 2004b. Deterministic clustering with data nets. In Proceedings of the International Symposium on Information Theory (ISIT).Google ScholarGoogle Scholar
  24. Fisher, D. 1996. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res. 4, 147--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Forgey, E. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 21, 768.Google ScholarGoogle Scholar
  26. Gersho, A. and Gray, R. M. 1992. Vector quantization and signal compression. Kluwer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gray, R. M. and Neuhoff, D. L. 1998. Quantization. IEEE Trans. Inform. Theory 44, 6, 2325--2384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual Symposium on Theory of Computing (STOC). 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Har-Peled, S., and Sadri, B. 2005. How fast is the k-means method? Algorithmica 41, 185--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. 1997. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci. 37, 861--870.Google ScholarGoogle ScholarCross RefCross Ref
  31. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. 2003. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. J. ACM 50, 795--824. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jain, K. and Vazirani, V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48, 274--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. 2004. A local search approximation algorithm for k-means clustering. Comput. Geom. 28, 89--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881--892. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley.Google ScholarGoogle Scholar
  37. Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual Symposium on Foundations of Computer Science (FOCS). 454--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantization design. IEEE Trans. Commun. COM-28, 84--95.Google ScholarGoogle ScholarCross RefCross Ref
  39. Lloyd, S. P. 1982. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory 28, 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.Google ScholarGoogle Scholar
  41. Matousek, J. 2000. On approximate geometric k-clustering. Disc. Computat. Geom. 24, 61--84.Google ScholarGoogle ScholarCross RefCross Ref
  42. Max, J. 1960. Quantizing for minimum distortion. IEEE Trans. Inform. Theory IT-6, 1, 7--12.Google ScholarGoogle ScholarCross RefCross Ref
  43. Meila, M. and Heckerman, D. 1998. An experimental comparison of several clustering and initialization methods. In Proceedings of the 14th Conference on Uncertainty in Arificial Intelligent (UAI). 386--395. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 35--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Milligan, G. 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45, 325--342.Google ScholarGoogle ScholarCross RefCross Ref
  46. Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge Univ. Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nissim, K., Rashkhodnikova, S., and Smith, A. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual Symposium on Theory of Computing (STOC). 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Ostrovsky, R. and Rabani, Y. 2002. Polynomial time approximation schemes for geometric clustering problems. J. ACM 49, 2, 139--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundation of Computer Science (FOCS). 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci. 61, 217--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Pelleg, D. and Moore, A. 1999. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the 5th Annual ACM Conference on Knowledge Discovery and Data Mining (KDD). 277--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Pena, J. M., Lozano, J. A., and Larranaga, P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Patt. Rec. Lett. 20, 1027--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Phillips, S. J. 2002. Acceleration of k-means and related clustering problems. In Proceedings of the 4th Workshop on Algorithm Engineering and Experiments (ALENEX). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Schulman, L. J. 2000. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC). 547--555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Snarey, M., Terrett, N. K., Willet, P., and Wilton, D. J. 1997. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model. 15, 372--385.Google ScholarGoogle ScholarCross RefCross Ref
  56. Spielman, D. and Teng, S. 2001. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proceedings of 33rd Annual Symposium on Theory of Computing (STOC). 296--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801--804.Google ScholarGoogle Scholar
  58. Tryon, R. C. and Bailey, D. E. 1970. Cluster Analysis. McGraw-Hill. 147--150.Google ScholarGoogle Scholar
  59. Vattani, A. 2011. k-means requires exponentially many iterations even in the plane. Disc. Comput. Geom. 45, 4, 596--616. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The effectiveness of lloyd-type methods for the k-means problem

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image Journal of the ACM
                Journal of the ACM  Volume 59, Issue 6
                December 2012
                213 pages
                ISSN:0004-5411
                EISSN:1557-735X
                DOI:10.1145/2395116
                Issue’s Table of Contents

                Copyright © 2013 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 9 January 2013
                • Revised: 1 September 2012
                • Accepted: 1 September 2012
                • Received: 1 March 2010
                Published in jacm Volume 59, Issue 6

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader