ABSTRACT
Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.
- Google: Cluster computing and mapreduce. http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html.Google Scholar
- P. K. Agarwal and N. H. Mustafa. k-means projective clustering. In PODS, pages 155--165, 2004. Google ScholarDigital Library
- D. Arthur and S. Vassilvitskii. k-means: the advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
- V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544--562, 2004. Google ScholarDigital Library
- Y. Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In FOCS, pages 184--193, 1996. Google ScholarDigital Library
- Y. Bartal. On approximating arbitrary metrices by tree metrics. In STOC, pages 161--168, 1998. Google ScholarDigital Library
- M. Berry. Mapreduce and k-means clustering. http://blog.data-miners.com/2008/02/mapreduce-and-k-means-clustering.html, 2008.Google Scholar
- G. E. Blelloch and K. Tangwongsan. Parallel approximation algorithms for facility-location problems. In SPAA, pages 315--324, 2010. Google ScholarDigital Library
- M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417--1440, 2004. Google ScholarDigital Library
- M. Charikar, C. Chekuri, A. Goel, and S. Guha. Rounding via trees: Deterministic approximation algorithms for group steiner trees and k-median. In STOC, pages 114--123, 1998. Google ScholarDigital Library
- M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In FOCS, pages 378--388, 1999. Google ScholarDigital Library
- M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci., 65(1):129--149, 2002. Google ScholarDigital Library
- K. Chen. On k-median clustering in high dimensions. In SODA, pages 1177--1185, 2006. Google ScholarDigital Library
- K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826--835, 2008. Google ScholarDigital Library
- F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI, pages 137--150, 2004. Google ScholarDigital Library
- M. E. Dyer and A. M. Frieze. A simple heuristic for the p-centre problem. Operations Research Letters, 3(6):285 -- 288, 1985.Google ScholarDigital Library
- J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6(4), 2010. Google ScholarDigital Library
- T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293 -- 306, 1985.Google Scholar
- S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng., 15(3):515--528, 2003. Google ScholarDigital Library
- A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008.Google Scholar
- R. Herwig, A. J. Poustka, C. Müller, C. Bull, H. Lehrach, and J. O'Brien. Large-Scale Clustering of cDNA-Fingerprinting Data. Genome Research, 9(11):1093--1105, November 1999.Google ScholarCross Ref
- D. S. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, May 1985.Google ScholarDigital Library
- K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location problems. In STOC, pages 731--740. ACM, 2002. Google ScholarDigital Library
- U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. Hadi: Fast diameter estimation and mining in massive graphs with hadoop. Technical report, School of Computer Science, Carnegie Mellon University Pittsburgh, December 2008.Google Scholar
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010. Google ScholarDigital Library
- J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
- S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--136, 1982.Google ScholarDigital Library
- A. Ma and I. K. Sethi. Distributed k-median clustering with application to image clustering. In PRIS, pages 215--220, 2007.Google Scholar
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarDigital Library
- R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX-RANDOM, pages 165--178, 2008. Google ScholarDigital Library
- M. Thorup. Quick k-median, k-center, and facility location for sparse graphs. SIAM J. Comput., 34(2):405--432, 2004. Google ScholarDigital Library
- T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
Index Terms
- Fast clustering using MapReduce
Recommendations
On objective-based rough c-means clustering
GRC '12: Proceedings of the 2012 IEEE International Conference on Granular Computing (GrC-2012)Conventional clustering algorithms classify a set of objects into some clusters with clear boundaries, that is, one object must belong to one cluster. However, many objects belong to more than one cluster in real world, since the boundaries of clusters ...
Matching Affinity Clustering: Improved Hierarchical Clustering at Scale with Guarantees
AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent SystemsHierarchical clustering is a stronger extension of one of today's most influential unsupervised learning methods: clustering. The goal of this method is to create a hierarchy of clusters, thus constructing cluster evolutionary history and simultaneously ...
The BigKClustering approach for document clustering using Hadoop MapReduce
PCI '18: Proceedings of the 22nd Pan-Hellenic Conference on InformaticsClustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. K-Means is one of the most commonly used methods of clustering, due to its high quality ...
Comments