research-article

The effectiveness of lloyd-type methods for the k-means problem

Authors:
Rafail Ostrovsky

University of California, Los Angeles, CA

University of California, Los Angeles, CA
View Profile

,
Yuval Rabani

The Hebrew University of Jerusalem, Jerusalem, Israel

The Hebrew University of Jerusalem, Jerusalem, Israel
View Profile

,
Leonard J. Schulman

California Institute of Technology, Pasadena, CA

California Institute of Technology, Pasadena, CA
View Profile

,
Chaitanya Swamy

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

Authors Info & Claims

Journal of the ACM Volume 59 Issue 6Article No.: 28pp 1–22https://doi.org/10.1145/2395116.2395117

Published:09 January 2013Publication History

Journal of the ACM

Abstract

We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

References

Aggarwal, A., Deshpande, A., and Kannan, R. 2009. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimazation Problems (APPROX 2009). 15--28. Google ScholarDigital Library
Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Proceedings of the 23rd Annual Conference on Neural Information Processing System (NIPS). 10--18.Google Scholar
Alsabti, K., Ranka, S., and Singh, V. 1998. An efficient k-means clustering algorithm. In Proceedings of the 1st Workshop on High Performance Data Mining.Google Scholar
Arthur, D., Manthey, B., and Röglin, H. 2011. Smoothed analysis of the k-means method. J. ACM 58, 5, 19. Google ScholarDigital Library
Arthur, D., and Vassilvitskii, S. 2006. How slow is the k-means method&quest; In Proceedings of the 22nd Annual Symposium on Computational Geometry (SoCG). 144--153. Google ScholarDigital Library
Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1027--1035. Google ScholarDigital Library
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. 2004. Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33, 544--562. Google ScholarDigital Library
Awasthi, P., Blum, A., and Sheffet, O. 2010. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS). 309--318. Google ScholarDigital Library
Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC). 250--257. Google ScholarDigital Library
Balcan, M., Blum, A., and Gupta, A. 2009. Approximate clustering without the approximation. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1068--1077. Google ScholarDigital Library
Bradley, P. S. and Fayyad, U. 1998. Refining initial points for K-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML). 91--99. Google ScholarDigital Library
Bunn, P. and Ostrovsky, R. 2007. Secure two-party k-means clustering. In Proceedings of the ACM Conference on Computer and Communications Security. 486--497. Google ScholarDigital Library
Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). 378--388. Google ScholarDigital Library
Charikar, M., Guha, S., Tardos, É., and Shmoys, D. B. 2002. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65, 129--149. Google ScholarDigital Library
Chrobak, M., Kenyon, C., and Young, N. 2006. The reverse greedy algorithm for the metric k-median problem. Info. Proc. Lett. 97, 68--72. Google ScholarDigital Library
Cox, D. R. 1957. Note on grouping. J. ASA 52, 543--547.Google Scholar
Dasgupta, S. 2003. How fast is k-means&quest; In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). 735.Google Scholar
de la Vega, W. F., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th ACM Symposium on Theory of Computing. 50--58. Google ScholarDigital Library
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Y. Stat. Soc. B 39, 1--38.Google Scholar
Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 2004. Clustering large graphs via the Singular Value Decomposition. Mach. Learn. 56, 9--33. Google ScholarDigital Library
Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google ScholarDigital Library
Effros, M. and Schulman, L. J. 2004a. Deterministic clustering with data nets. Electronic Tech rep. ECCC TR04-050.Google Scholar
Effros, M. and Schulman, L. J. 2004b. Deterministic clustering with data nets. In Proceedings of the International Symposium on Information Theory (ISIT).Google Scholar
Fisher, D. 1996. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res. 4, 147--178. Google ScholarDigital Library
Forgey, E. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 21, 768.Google Scholar
Gersho, A. and Gray, R. M. 1992. Vector quantization and signal compression. Kluwer. Google ScholarDigital Library
Gray, R. M. and Neuhoff, D. L. 1998. Quantization. IEEE Trans. Inform. Theory 44, 6, 2325--2384. Google ScholarDigital Library
Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual Symposium on Theory of Computing (STOC). 291--300. Google ScholarDigital Library
Har-Peled, S., and Sadri, B. 2005. How fast is the k-means method&quest; Algorithmica 41, 185--202. Google ScholarDigital Library
Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. 1997. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci. 37, 861--870.Google ScholarCross Ref
Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3. Google ScholarDigital Library
Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. 2003. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. J. ACM 50, 795--824. Google ScholarDigital Library
Jain, K. and Vazirani, V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48, 274--296. Google ScholarDigital Library
Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. 2004. A local search approximation algorithm for k-means clustering. Comput. Geom. 28, 89--112. Google ScholarDigital Library
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881--892. Google ScholarDigital Library
Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley.Google Scholar
Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1 + &epsi;)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual Symposium on Foundations of Computer Science (FOCS). 454--462. Google ScholarDigital Library
Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantization design. IEEE Trans. Commun. COM-28, 84--95.Google ScholarCross Ref
Lloyd, S. P. 1982. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory 28, 129--137. Google ScholarDigital Library
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.Google Scholar
Matousek, J. 2000. On approximate geometric k-clustering. Disc. Computat. Geom. 24, 61--84.Google ScholarCross Ref
Max, J. 1960. Quantizing for minimum distortion. IEEE Trans. Inform. Theory IT-6, 1, 7--12.Google ScholarCross Ref
Meila, M. and Heckerman, D. 1998. An experimental comparison of several clustering and initialization methods. In Proceedings of the 14th Conference on Uncertainty in Arificial Intelligent (UAI). 386--395. Google ScholarDigital Library
Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 35--60. Google ScholarDigital Library
Milligan, G. 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45, 325--342.Google ScholarCross Ref
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge Univ. Press. Google ScholarDigital Library
Nissim, K., Rashkhodnikova, S., and Smith, A. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual Symposium on Theory of Computing (STOC). 75--84. Google ScholarDigital Library
Ostrovsky, R. and Rabani, Y. 2002. Polynomial time approximation schemes for geometric clustering problems. J. ACM 49, 2, 139--156. Google ScholarDigital Library
Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundation of Computer Science (FOCS). 165--174. Google ScholarDigital Library
Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci. 61, 217--235. Google ScholarDigital Library
Pelleg, D. and Moore, A. 1999. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the 5th Annual ACM Conference on Knowledge Discovery and Data Mining (KDD). 277--281. Google ScholarDigital Library
Pena, J. M., Lozano, J. A., and Larranaga, P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Patt. Rec. Lett. 20, 1027--1040. Google ScholarDigital Library
Phillips, S. J. 2002. Acceleration of k-means and related clustering problems. In Proceedings of the 4th Workshop on Algorithm Engineering and Experiments (ALENEX). Google ScholarDigital Library
Schulman, L. J. 2000. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC). 547--555. Google ScholarDigital Library
Snarey, M., Terrett, N. K., Willet, P., and Wilton, D. J. 1997. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model. 15, 372--385.Google ScholarCross Ref
Spielman, D. and Teng, S. 2001. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proceedings of 33rd Annual Symposium on Theory of Computing (STOC). 296--305. Google ScholarDigital Library
Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801--804.Google Scholar
Tryon, R. C. and Bailey, D. E. 1970. Cluster Analysis. McGraw-Hill. 147--150.Google Scholar
Vattani, A. 2011. k-means requires exponentially many iterations even in the plane. Disc. Comput. Geom. 45, 4, 596--616. Google ScholarDigital Library

Index Terms

The effectiveness of lloyd-type methods for the k-means problem

Recommendations

The Effectiveness of Lloyd-Type Methods for the k-Means Problem
FOCS '06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science

We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose ...
Read More
Hartigan's K-means versus Lloyd's K-means: is it time for a change?
IJCAI '13: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Hartigan's method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd's algorithm. E.g., it was recently shown that the set of local minima of Hartigan's algorithm is a ...
Read More
On objective-based rough c-means clustering
GRC '12: Proceedings of the 2012 IEEE International Conference on Granular Computing (GrC-2012)

Conventional clustering algorithms classify a set of objects into some clusters with clear boundaries, that is, one object must belong to one cluster. However, many objects belong to more than one cluster in real world, since the boundaries of clusters ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 59, Issue 6
December 2012
213 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2395116
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2013
- Revised: 1 September 2012
- Accepted: 1 September 2012
- Received: 1 March 2010
Published in jacm Volume 59, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Randomized algorithms
approximation algorithms
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 93
  Total Citations
  View Citations
- 1,467
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The effectiveness of lloyd-type methods for the k-means problem

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

The Effectiveness of Lloyd-Type Methods for the k-Means Problem

Hartigan's K-means versus Lloyd's K-means: is it time for a change?

On objective-based rough c-means clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The effectiveness of lloyd-type methods for the k-means problem

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

The Effectiveness of Lloyd-Type Methods for the k-Means Problem

Hartigan's K-means versus Lloyd's K-means: is it time for a change?

On objective-based rough c-means clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media