research-article

Fast clustering using MapReduce

Authors:
Alina Ene

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Sungjin Im

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Benjamin Moseley

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2011Pages 681–689https://doi.org/10.1145/2020408.2020515

Published:21 August 2011Publication History

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 681–689

ABSTRACT

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC⁰, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

References

Google: Cluster computing and mapreduce. http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html.Google Scholar
P. K. Agarwal and N. H. Mustafa. k-means projective clustering. In PODS, pages 155--165, 2004. Google ScholarDigital Library
D. Arthur and S. Vassilvitskii. k-means: the advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544--562, 2004. Google ScholarDigital Library
Y. Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In FOCS, pages 184--193, 1996. Google ScholarDigital Library
Y. Bartal. On approximating arbitrary metrices by tree metrics. In STOC, pages 161--168, 1998. Google ScholarDigital Library
M. Berry. Mapreduce and k-means clustering. http://blog.data-miners.com/2008/02/mapreduce-and-k-means-clustering.html, 2008.Google Scholar
G. E. Blelloch and K. Tangwongsan. Parallel approximation algorithms for facility-location problems. In SPAA, pages 315--324, 2010. Google ScholarDigital Library
M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417--1440, 2004. Google ScholarDigital Library
M. Charikar, C. Chekuri, A. Goel, and S. Guha. Rounding via trees: Deterministic approximation algorithms for group steiner trees and k-median. In STOC, pages 114--123, 1998. Google ScholarDigital Library
M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In FOCS, pages 378--388, 1999. Google ScholarDigital Library
M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci., 65(1):129--149, 2002. Google ScholarDigital Library
K. Chen. On k-median clustering in high dimensions. In SODA, pages 1177--1185, 2006. Google ScholarDigital Library
K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826--835, 2008. Google ScholarDigital Library
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI, pages 137--150, 2004. Google ScholarDigital Library
M. E. Dyer and A. M. Frieze. A simple heuristic for the p-centre problem. Operations Research Letters, 3(6):285 -- 288, 1985.Google ScholarDigital Library
J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6(4), 2010. Google ScholarDigital Library
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293 -- 306, 1985.Google Scholar
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng., 15(3):515--528, 2003. Google ScholarDigital Library
A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008.Google Scholar
R. Herwig, A. J. Poustka, C. Müller, C. Bull, H. Lehrach, and J. O'Brien. Large-Scale Clustering of cDNA-Fingerprinting Data. Genome Research, 9(11):1093--1105, November 1999.Google ScholarCross Ref
D. S. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, May 1985.Google ScholarDigital Library
K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location problems. In STOC, pages 731--740. ACM, 2002. Google ScholarDigital Library
U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec. Hadi: Fast diameter estimation and mining in massive graphs with hadoop. Technical report, School of Computer Science, Carnegie Mellon University Pittsburgh, December 2008.Google Scholar
H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010. Google ScholarDigital Library
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--136, 1982.Google ScholarDigital Library
A. Ma and I. K. Sethi. Distributed k-median clustering with application to image clustering. In PRIS, pages 215--220, 2007.Google Scholar
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarDigital Library
R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX-RANDOM, pages 165--178, 2008. Google ScholarDigital Library
M. Thorup. Quick k-median, k-center, and facility location for sparse graphs. SIAM J. Comput., 34(2):405--432, 2004. Google ScholarDigital Library
T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library

Index Terms

Fast clustering using MapReduce
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

On objective-based rough c-means clustering
GRC '12: Proceedings of the 2012 IEEE International Conference on Granular Computing (GrC-2012)

Conventional clustering algorithms classify a set of objects into some clusters with clear boundaries, that is, one object must belong to one cluster. However, many objects belong to more than one cluster in real world, since the boundaries of clusters ...
Read More
Matching Affinity Clustering: Improved Hierarchical Clustering at Scale with Guarantees
AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems

Hierarchical clustering is a stronger extension of one of today's most influential unsupervised learning methods: clustering. The goal of this method is to create a hierarchy of clusters, thus constructing cluster evolutionary history and simultaneously ...
Read More
The BigKClustering approach for document clustering using Hadoop MapReduce
PCI '18: Proceedings of the 22nd Pan-Hellenic Conference on Informatics

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. K-Means is one of the most commonly used methods of clustering, due to its high quality ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
approximation algorithms
clustering algorithms
k-center
k-median
parallel algorithms
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 1,916
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast clustering using MapReduce

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

On objective-based rough c-means clustering

Matching Affinity Clustering: Improved Hierarchical Clustering at Scale with Guarantees

The BigKClustering approach for document clustering using Hadoop MapReduce