Article

A local search approximation algorithm for k-means clustering

Authors:
Tapas Kanungo

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
David M. Mount

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

,
Nathan S. Netanyahu

Science, Bar-Ilan University, Ramat-Gan, Israel and Center for Automation Research, University of Maryland, College Park, MD

Science, Bar-Ilan University, Ramat-Gan, Israel and Center for Automation Research, University of Maryland, College Park, MD
View Profile

,
Christine D. Piatko

The Johns Hopkins University Applied Physics Laboratory, Laurel, MD

The Johns Hopkins University Applied Physics Laboratory, Laurel, MD
View Profile

,
Ruth Silverman

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

,
Angela Y. Wu

American University, Washington, DC

American University, Washington, DC
View Profile

SCG '02: Proceedings of the eighteenth annual symposium on Computational geometryJune 2002Pages 10–18https://doi.org/10.1145/513400.513402

Published:05 June 2002Publication History

SCG '02: Proceedings of the eighteenth annual symposium on Computational geometry

Pages 10–18

ABSTRACT

In k-means clustering we are given a set of n data points in d-dimensional space ℜ^d and an integer k, and the problem is to determine a set of k points in ℜ^d, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the extremely high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+ε)-approximation algorithm. We show that the approximation factor is almost tight, by giving an example for which the algorithm achieves an approximation factor of (9-ε). To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.

References

P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 658--667, San Francisco, CA, January 1998. Google ScholarDigital Library
K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proceedings of the First Workshop on High Performance Data Mining, Orlando, FL, March 1998.Google Scholar
S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-median and related problems. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 106--113, Dallas, TX, May 1998. Google ScholarDigital Library
V. Arya, N. Garg, R. Khandekar, V. Pandit, A. Meyerson, and K. Munagala. Local search heuristics for k-median and facility location problems. In Proceedings of the 33rd Annual Symposium on Theory of Computing, pages 21--29, Crete, Greece, July 2001. Google ScholarDigital Library
G. H. Ball and D. J. Hall. Some fundamental concepts and synthesis procedures for pattern recognition preprocessors. In International Conference on Microwaves, Circuit Theory, and Information Theory, Tokyo, Japan, September 1964.Google Scholar
S. Bandyopadhyay, U. Maulik, and M. K. Pakhira. Clustering using simulated annealing with probabilistic redistribution. International J. Patt. Recog. and Artif. Intell, 15:269--285, 2001.Google ScholarCross Ref
V. Capoyleas, G. Rote, and G. Woeginger. Geometric clusterings. Journal of Algorithms, 12:341--356, 1991. Google ScholarDigital Library
M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-medians problem. In Proceedings of the 4th Annual IEEE Symposium on Foundations of Computer Science, pages 378--388, 1999. Google ScholarDigital Library
Q. Du, V. Faber, and M. Gunzburger. Centroidal Voronoi tesselations: Applications and algorithms. SIAM Review, 41:637--676, 1999. Google ScholarDigital Library
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, NY, 1973.Google ScholarDigital Library
A. E. ElGamal, L. A. Hemanchandra, I. Shperling, and V. K. Wei. Using simulated annealing to design good codes. IEEE Trans. Information Theory, 33:116--123, 1987. Google ScholarDigital Library
V. Faber. Clustering and the continuous k-means algorithm. Los Alamos Science, 22:138--144, 1994.Google Scholar
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Google ScholarDigital Library
E. Forgey. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21:768, 1965.Google Scholar
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York, NY, 1979. Google ScholarDigital Library
A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic, Boston, MA, 1992. Google ScholarDigital Library
M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual ACM Symposium on Computational Geometry, pages 332--339, Stony Brook, NY, June 1994. Google ScholarDigital Library
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988. Google ScholarDigital Library
A. K. Jain, P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4--37, 2000. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Patt. Anal. Mach. Intell., 24, 2002. (To appear). Google ScholarDigital Library
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY, 1990.Google Scholar
S. Kirkpatrick, Jr. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671--680, 1983.Google ScholarCross Ref
T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, New York, NY, 3rd edition, 1989. Google ScholarDigital Library
S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the Euclidean k-median problem. In J. Nesetril, editor, Proceedings of the Seventh Annual European Symposium on Algorithms, volume 1643 of Lecture Notes Comput. Sci., pages 362--371. Springer-Verlag, July 1999. Google ScholarDigital Library
M. Korupolu, C. Plaxton, and R. Rajaraman. Analysis of a local search heuristic for facility location problems. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1--10, San Francisco, CA, January 1998. Google ScholarDigital Library
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281--296, Berkeley, CA, 1967.Google Scholar
J. Matoušek. On approximate geometric k-clustering. Discrete and Computational Geometry, 24:61--84, 2000.Google ScholarCross Ref
R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 144--155, Santiago, Chile, September 1994. Google ScholarDigital Library
D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 277--281, San Diego, CA, August 1999. Google ScholarDigital Library
D. Pelleg and A. Moore. x-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, Palo Alto, CA, July 2000. Google ScholarDigital Library
S. J. Phillips. Acceleration of k-means and related clustering problems. In Workshop on Algorithm Engineering and Experiments (ALENEX'02), January 2002. Google ScholarDigital Library
J. Vaisey and A. Gersho. Simulated annealing and codebook design. In Proc. IEEE Int'l. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 1176--1179, 1988.Google ScholarCross Ref

Index Terms

A local search approximation algorithm for k-means clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Theory of computation
  1. Randomness, geometry and discrete structures
    1. Computational geometry

Recommendations

A local search approximation algorithm for k-means clustering
Special issue on the 18th annual symposium on computational geometrySoCG2002

In k-means clustering we are given a set of n data points in d-dimensional space R^d and an integer k, and the problem is to determine a set of k points in R^d, called centers, to minimize the mean squared distance from each data point to its nearest ...
Read More
Ant clustering algorithm with K-harmonic means clustering

Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Read More
Clustering under approximation stability

A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distance-based objective such as the k-median, k-means, or min-sum score. For applications such as clustering proteins by function ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SCG '02: Proceedings of the eighteenth annual symposium on Computational geometry
June 2002
330 pages
ISBN:1581135041
DOI:10.1145/513400
Conference Chairs:
Ferran Hurtado
Universitat Politècnica de Catalunya
,
Vera Sacristán
Universitat Politècnica de Catalunya
,
Program Chairs:
Chandrajit Bajaj
University of Texas at Austin
,
Subhash Suri
University of California at Santa Barbara
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximation algorithms
clustering
computational geometry
k-means
local search
Qualifiers
- Article
Conference

Acceptance Rates
SCG '02 Paper Acceptance Rate35of104submissions,34%Overall Acceptance Rate625of1,685submissions,37%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 103
  Total Citations
  View Citations
- 327
  Total Downloads
- Downloads (Last 12 months)120
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A local search approximation algorithm for k-means clustering

SCG '02: Proceedings of the eighteenth annual symposium on Computational geometry

ABSTRACT

References

Cited By

Index Terms

Recommendations

A local search approximation algorithm for k-means clustering

Ant clustering algorithm with K-harmonic means clustering

Clustering under approximation stability