skip to main content
10.1145/513400.513402acmconferencesArticle/Chapter ViewAbstractPublication PagessocgConference Proceedingsconference-collections
Article

A local search approximation algorithm for k-means clustering

Authors Info & Claims
Published:05 June 2002Publication History

ABSTRACT

In k-means clustering we are given a set of n data points in d-dimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the extremely high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+ε)-approximation algorithm. We show that the approximation factor is almost tight, by giving an example for which the algorithm achieves an approximation factor of (9-ε). To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.

References

  1. P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 658--667, San Francisco, CA, January 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proceedings of the First Workshop on High Performance Data Mining, Orlando, FL, March 1998.Google ScholarGoogle Scholar
  3. S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-median and related problems. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 106--113, Dallas, TX, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Arya, N. Garg, R. Khandekar, V. Pandit, A. Meyerson, and K. Munagala. Local search heuristics for k-median and facility location problems. In Proceedings of the 33rd Annual Symposium on Theory of Computing, pages 21--29, Crete, Greece, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. H. Ball and D. J. Hall. Some fundamental concepts and synthesis procedures for pattern recognition preprocessors. In International Conference on Microwaves, Circuit Theory, and Information Theory, Tokyo, Japan, September 1964.Google ScholarGoogle Scholar
  6. S. Bandyopadhyay, U. Maulik, and M. K. Pakhira. Clustering using simulated annealing with probabilistic redistribution. International J. Patt. Recog. and Artif. Intell, 15:269--285, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  7. V. Capoyleas, G. Rote, and G. Woeginger. Geometric clusterings. Journal of Algorithms, 12:341--356, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-medians problem. In Proceedings of the 4th Annual IEEE Symposium on Foundations of Computer Science, pages 378--388, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Du, V. Faber, and M. Gunzburger. Centroidal Voronoi tesselations: Applications and algorithms. SIAM Review, 41:637--676, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, NY, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. E. ElGamal, L. A. Hemanchandra, I. Shperling, and V. K. Wei. Using simulated annealing to design good codes. IEEE Trans. Information Theory, 33:116--123, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Faber. Clustering and the continuous k-means algorithm. Los Alamos Science, 22:138--144, 1994.Google ScholarGoogle Scholar
  13. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Forgey. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21:768, 1965.Google ScholarGoogle Scholar
  15. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York, NY, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic, Boston, MA, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual ACM Symposium on Computational Geometry, pages 332--339, Stony Brook, NY, June 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. K. Jain, P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4--37, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Patt. Anal. Mach. Intell., 24, 2002. (To appear). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY, 1990.Google ScholarGoogle Scholar
  23. S. Kirkpatrick, Jr. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671--680, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  24. T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, New York, NY, 3rd edition, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the Euclidean k-median problem. In J. Nesetril, editor, Proceedings of the Seventh Annual European Symposium on Algorithms, volume 1643 of Lecture Notes Comput. Sci., pages 362--371. Springer-Verlag, July 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Korupolu, C. Plaxton, and R. Rajaraman. Analysis of a local search heuristic for facility location problems. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1--10, San Francisco, CA, January 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281--296, Berkeley, CA, 1967.Google ScholarGoogle Scholar
  28. J. Matoušek. On approximate geometric k-clustering. Discrete and Computational Geometry, 24:61--84, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 144--155, Santiago, Chile, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 277--281, San Diego, CA, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Pelleg and A. Moore. x-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, Palo Alto, CA, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. J. Phillips. Acceleration of k-means and related clustering problems. In Workshop on Algorithm Engineering and Experiments (ALENEX'02), January 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Vaisey and A. Gersho. Simulated annealing and codebook design. In Proc. IEEE Int'l. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 1176--1179, 1988.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A local search approximation algorithm for k-means clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SCG '02: Proceedings of the eighteenth annual symposium on Computational geometry
        June 2002
        330 pages
        ISBN:1581135041
        DOI:10.1145/513400

        Copyright © 2002 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 June 2002

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        SCG '02 Paper Acceptance Rate35of104submissions,34%Overall Acceptance Rate625of1,685submissions,37%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader