Abstract
We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k = 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m points; this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right.
Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.
Article PDF
Similar content being viewed by others
References
Achlioptas, D., & McSherry, F. (2001). Fast computation of low rank approximations. In Proceedings of the 33rd Annual Symposium on Theory of Computing (pp. 337–346).
Achlioptas, D., & McSherry, F. (2003). Fast computation of low rank matrix approximations. Manuscript.
Andrews, H. C., & Patterson, C. L. (1976a). Singular value decomposition image coding. IEEE Trans. on Communications, 4, 425–432.
Andrews, H. C., & Patterson, C. L. (1976b). Singular value decompositions and digital image processing. IEEE Trans. ASSP 26–53.
Azar, Y., Fiat, A., Karlin, A., McSherry, F., & Saia, J. (2001). Spectral analysis of data. In Proc. of the 33rd ACM Symposium on Theory of Computing (pp. 619–626).
Berry, M. J., & Linoff, G. (1997). Data mining techniques. John-Wiley.
Bar-Yossef, Z. (2002). The complexity of massive dataset computations. Ph.D. thesis, University of California, Berkeley.
Bar-Yossef, Z. (2003). Sampling lower bounds via information theory. In Proceedings of the 35th Annual Symposium on Theory of Computing (pp. 335–344).
Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (pp. 378–388).
Charikar, M., Guha, S., Shmoys, D., & Tardos, E. (2002). A constant factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65:1, 129–149.
Drineas, P., & Kannan, R. (2001). Fast Monte-Carlo algorithms for approximate matrix multiplication. In Proceedings of the 42nd Annual Symposium on Foundations of Computer Science (pp. 452–459).
Frieze, A., Kannan, R., & Vempala, S. (1998). Fast Monte-Carlo algorithms for finding low rank approximations. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science (pp. 370–378).
Furedi, Z., & Komlos, J. (1981). The eigenvalues of random symmetric matrices. Combinatorica, 1, 233–241.
Gersho, A., & Gray, R. M. (1991). Vector quantization and signal compression. Kluwer Academic.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Clustering categorical data: An approach based on dynamical systems. Very Large Data Bases (VLDB) 311–322.
Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation. Journal of the ACM, 5:4, 653–750.
Golub, G., & Van Loan, C. (1989). Matrix computations. Johns Hopkins University Press.
Huang, T., & Narendra, P. (1974). Image restoration by singular value decomposition. Applied Optics, 14:9, 2213–2216.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.
Jain, K., & Vazirani, V. (1999). Primal-dual approximation algorithms for metric facility location and k-median problems. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (pp. 2–13).
Jambu, M., & Lebeaux, M.-O. (1983). Cluster analysis and data analysis. North Holland.
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2000). The analysis of a simple k-means clustering algorithm. In Symposium on Computational Geometry (pp. 100–109).
Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of 9th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 668–677).
McDiarmid, C. J. H. (1989). On the method of bounded differences. In Surveys in Combinatorics: Invited Papers at the 12th British Combinatorial Conference (pp. 148–188).
Ostrovsky, R., & Rabani, Y. (2002). Polynomial time approximation schemes for geometric k-clustering. Journal of the ACM, 49:2, 139–156.
Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61:2, 217–235.
Parlett, B. (1997). The symmetric eigenvalue problem. Classics in Applied Mathematics, SIAM.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Drineas, P., Frieze, A., Kannan, R. et al. Clustering Large Graphs via the Singular Value Decomposition. Machine Learning 56, 9–33 (2004). https://doi.org/10.1023/B:MACH.0000033113.59016.96
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000033113.59016.96