Abstract
We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.
- Aggarwal, A., Deshpande, A., and Kannan, R. 2009. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimazation Problems (APPROX 2009). 15--28. Google ScholarDigital Library
- Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Proceedings of the 23rd Annual Conference on Neural Information Processing System (NIPS). 10--18.Google Scholar
- Alsabti, K., Ranka, S., and Singh, V. 1998. An efficient k-means clustering algorithm. In Proceedings of the 1st Workshop on High Performance Data Mining.Google Scholar
- Arthur, D., Manthey, B., and Röglin, H. 2011. Smoothed analysis of the k-means method. J. ACM 58, 5, 19. Google ScholarDigital Library
- Arthur, D., and Vassilvitskii, S. 2006. How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (SoCG). 144--153. Google ScholarDigital Library
- Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1027--1035. Google ScholarDigital Library
- Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. 2004. Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33, 544--562. Google ScholarDigital Library
- Awasthi, P., Blum, A., and Sheffet, O. 2010. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS). 309--318. Google ScholarDigital Library
- Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC). 250--257. Google ScholarDigital Library
- Balcan, M., Blum, A., and Gupta, A. 2009. Approximate clustering without the approximation. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1068--1077. Google ScholarDigital Library
- Bradley, P. S. and Fayyad, U. 1998. Refining initial points for K-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML). 91--99. Google ScholarDigital Library
- Bunn, P. and Ostrovsky, R. 2007. Secure two-party k-means clustering. In Proceedings of the ACM Conference on Computer and Communications Security. 486--497. Google ScholarDigital Library
- Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). 378--388. Google ScholarDigital Library
- Charikar, M., Guha, S., Tardos, É., and Shmoys, D. B. 2002. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65, 129--149. Google ScholarDigital Library
- Chrobak, M., Kenyon, C., and Young, N. 2006. The reverse greedy algorithm for the metric k-median problem. Info. Proc. Lett. 97, 68--72. Google ScholarDigital Library
- Cox, D. R. 1957. Note on grouping. J. ASA 52, 543--547.Google Scholar
- Dasgupta, S. 2003. How fast is k-means? In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). 735.Google Scholar
- de la Vega, W. F., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th ACM Symposium on Theory of Computing. 50--58. Google ScholarDigital Library
- Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Y. Stat. Soc. B 39, 1--38.Google Scholar
- Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 2004. Clustering large graphs via the Singular Value Decomposition. Mach. Learn. 56, 9--33. Google ScholarDigital Library
- Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google ScholarDigital Library
- Effros, M. and Schulman, L. J. 2004a. Deterministic clustering with data nets. Electronic Tech rep. ECCC TR04-050.Google Scholar
- Effros, M. and Schulman, L. J. 2004b. Deterministic clustering with data nets. In Proceedings of the International Symposium on Information Theory (ISIT).Google Scholar
- Fisher, D. 1996. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res. 4, 147--178. Google ScholarDigital Library
- Forgey, E. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 21, 768.Google Scholar
- Gersho, A. and Gray, R. M. 1992. Vector quantization and signal compression. Kluwer. Google ScholarDigital Library
- Gray, R. M. and Neuhoff, D. L. 1998. Quantization. IEEE Trans. Inform. Theory 44, 6, 2325--2384. Google ScholarDigital Library
- Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual Symposium on Theory of Computing (STOC). 291--300. Google ScholarDigital Library
- Har-Peled, S., and Sadri, B. 2005. How fast is the k-means method? Algorithmica 41, 185--202. Google ScholarDigital Library
- Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. 1997. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci. 37, 861--870.Google ScholarCross Ref
- Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3. Google ScholarDigital Library
- Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. 2003. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. J. ACM 50, 795--824. Google ScholarDigital Library
- Jain, K. and Vazirani, V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48, 274--296. Google ScholarDigital Library
- Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. 2004. A local search approximation algorithm for k-means clustering. Comput. Geom. 28, 89--112. Google ScholarDigital Library
- Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881--892. Google ScholarDigital Library
- Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley.Google Scholar
- Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual Symposium on Foundations of Computer Science (FOCS). 454--462. Google ScholarDigital Library
- Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantization design. IEEE Trans. Commun. COM-28, 84--95.Google ScholarCross Ref
- Lloyd, S. P. 1982. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory 28, 129--137. Google ScholarDigital Library
- MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.Google Scholar
- Matousek, J. 2000. On approximate geometric k-clustering. Disc. Computat. Geom. 24, 61--84.Google ScholarCross Ref
- Max, J. 1960. Quantizing for minimum distortion. IEEE Trans. Inform. Theory IT-6, 1, 7--12.Google ScholarCross Ref
- Meila, M. and Heckerman, D. 1998. An experimental comparison of several clustering and initialization methods. In Proceedings of the 14th Conference on Uncertainty in Arificial Intelligent (UAI). 386--395. Google ScholarDigital Library
- Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 35--60. Google ScholarDigital Library
- Milligan, G. 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45, 325--342.Google ScholarCross Ref
- Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge Univ. Press. Google ScholarDigital Library
- Nissim, K., Rashkhodnikova, S., and Smith, A. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual Symposium on Theory of Computing (STOC). 75--84. Google ScholarDigital Library
- Ostrovsky, R. and Rabani, Y. 2002. Polynomial time approximation schemes for geometric clustering problems. J. ACM 49, 2, 139--156. Google ScholarDigital Library
- Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundation of Computer Science (FOCS). 165--174. Google ScholarDigital Library
- Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci. 61, 217--235. Google ScholarDigital Library
- Pelleg, D. and Moore, A. 1999. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the 5th Annual ACM Conference on Knowledge Discovery and Data Mining (KDD). 277--281. Google ScholarDigital Library
- Pena, J. M., Lozano, J. A., and Larranaga, P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Patt. Rec. Lett. 20, 1027--1040. Google ScholarDigital Library
- Phillips, S. J. 2002. Acceleration of k-means and related clustering problems. In Proceedings of the 4th Workshop on Algorithm Engineering and Experiments (ALENEX). Google ScholarDigital Library
- Schulman, L. J. 2000. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC). 547--555. Google ScholarDigital Library
- Snarey, M., Terrett, N. K., Willet, P., and Wilton, D. J. 1997. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model. 15, 372--385.Google ScholarCross Ref
- Spielman, D. and Teng, S. 2001. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proceedings of 33rd Annual Symposium on Theory of Computing (STOC). 296--305. Google ScholarDigital Library
- Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801--804.Google Scholar
- Tryon, R. C. and Bailey, D. E. 1970. Cluster Analysis. McGraw-Hill. 147--150.Google Scholar
- Vattani, A. 2011. k-means requires exponentially many iterations even in the plane. Disc. Comput. Geom. 45, 4, 596--616. Google ScholarDigital Library
Index Terms
- The effectiveness of lloyd-type methods for the k-means problem
Recommendations
The Effectiveness of Lloyd-Type Methods for the k-Means Problem
FOCS '06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer ScienceWe investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose ...
Hartigan's K-means versus Lloyd's K-means: is it time for a change?
IJCAI '13: Proceedings of the Twenty-Third international joint conference on Artificial IntelligenceHartigan's method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd's algorithm. E.g., it was recently shown that the set of local minima of Hartigan's algorithm is a ...
On objective-based rough c-means clustering
GRC '12: Proceedings of the 2012 IEEE International Conference on Granular Computing (GrC-2012)Conventional clustering algorithms classify a set of objects into some clusters with clear boundaries, that is, one object must belong to one cluster. However, many objects belong to more than one cluster in real world, since the boundaries of clusters ...
Comments