ABSTRACT
In k-means clustering we are given a set of n data points in d-dimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the extremely high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+ε)-approximation algorithm. We show that the approximation factor is almost tight, by giving an example for which the algorithm achieves an approximation factor of (9-ε). To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.
- P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 658--667, San Francisco, CA, January 1998. Google ScholarDigital Library
- K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proceedings of the First Workshop on High Performance Data Mining, Orlando, FL, March 1998.Google Scholar
- S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-median and related problems. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 106--113, Dallas, TX, May 1998. Google ScholarDigital Library
- V. Arya, N. Garg, R. Khandekar, V. Pandit, A. Meyerson, and K. Munagala. Local search heuristics for k-median and facility location problems. In Proceedings of the 33rd Annual Symposium on Theory of Computing, pages 21--29, Crete, Greece, July 2001. Google ScholarDigital Library
- G. H. Ball and D. J. Hall. Some fundamental concepts and synthesis procedures for pattern recognition preprocessors. In International Conference on Microwaves, Circuit Theory, and Information Theory, Tokyo, Japan, September 1964.Google Scholar
- S. Bandyopadhyay, U. Maulik, and M. K. Pakhira. Clustering using simulated annealing with probabilistic redistribution. International J. Patt. Recog. and Artif. Intell, 15:269--285, 2001.Google ScholarCross Ref
- V. Capoyleas, G. Rote, and G. Woeginger. Geometric clusterings. Journal of Algorithms, 12:341--356, 1991. Google ScholarDigital Library
- M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-medians problem. In Proceedings of the 4th Annual IEEE Symposium on Foundations of Computer Science, pages 378--388, 1999. Google ScholarDigital Library
- Q. Du, V. Faber, and M. Gunzburger. Centroidal Voronoi tesselations: Applications and algorithms. SIAM Review, 41:637--676, 1999. Google ScholarDigital Library
- R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, NY, 1973.Google ScholarDigital Library
- A. E. ElGamal, L. A. Hemanchandra, I. Shperling, and V. K. Wei. Using simulated annealing to design good codes. IEEE Trans. Information Theory, 33:116--123, 1987. Google ScholarDigital Library
- V. Faber. Clustering and the continuous k-means algorithm. Los Alamos Science, 22:138--144, 1994.Google Scholar
- U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Google ScholarDigital Library
- E. Forgey. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21:768, 1965.Google Scholar
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York, NY, 1979. Google ScholarDigital Library
- A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic, Boston, MA, 1992. Google ScholarDigital Library
- M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual ACM Symposium on Computational Geometry, pages 332--339, Stony Brook, NY, June 1994. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988. Google ScholarDigital Library
- A. K. Jain, P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4--37, 2000. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Patt. Anal. Mach. Intell., 24, 2002. (To appear). Google ScholarDigital Library
- L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY, 1990.Google Scholar
- S. Kirkpatrick, Jr. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671--680, 1983.Google ScholarCross Ref
- T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, New York, NY, 3rd edition, 1989. Google ScholarDigital Library
- S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the Euclidean k-median problem. In J. Nesetril, editor, Proceedings of the Seventh Annual European Symposium on Algorithms, volume 1643 of Lecture Notes Comput. Sci., pages 362--371. Springer-Verlag, July 1999. Google ScholarDigital Library
- M. Korupolu, C. Plaxton, and R. Rajaraman. Analysis of a local search heuristic for facility location problems. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1--10, San Francisco, CA, January 1998. Google ScholarDigital Library
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281--296, Berkeley, CA, 1967.Google Scholar
- J. Matoušek. On approximate geometric k-clustering. Discrete and Computational Geometry, 24:61--84, 2000.Google ScholarCross Ref
- R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 144--155, Santiago, Chile, September 1994. Google ScholarDigital Library
- D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 277--281, San Diego, CA, August 1999. Google ScholarDigital Library
- D. Pelleg and A. Moore. x-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, Palo Alto, CA, July 2000. Google ScholarDigital Library
- S. J. Phillips. Acceleration of k-means and related clustering problems. In Workshop on Algorithm Engineering and Experiments (ALENEX'02), January 2002. Google ScholarDigital Library
- J. Vaisey and A. Gersho. Simulated annealing and codebook design. In Proc. IEEE Int'l. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 1176--1179, 1988.Google ScholarCross Ref
Index Terms
- A local search approximation algorithm for k-means clustering
Recommendations
A local search approximation algorithm for k-means clustering
Special issue on the 18th annual symposium on computational geometrySoCG2002In k-means clustering we are given a set of n data points in d-dimensional space Rd and an integer k, and the problem is to determine a set of k points in Rd, called centers, to minimize the mean squared distance from each data point to its nearest ...
Ant clustering algorithm with K-harmonic means clustering
Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Clustering under approximation stability
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distance-based objective such as the k-median, k-means, or min-sum score. For applications such as clustering proteins by function ...
Comments