research-article

ClusterJoin: a similarity joins framework using map-reduce

Authors:
Akash Das Sarma

Stanford University Palo Alto, CA

Stanford University Palo Alto, CA
View Profile

,
Yeye He

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Surajit Chaudhuri

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 12pp 1059–1070https://doi.org/10.14778/2732977.2732981

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based on the distance threshold. We design a set of strong candidate filters specific to different distance functions using a novel bisector-based framework, so that each record only needs to be distributed to a small number of partitions while still guaranteeing correctness. To address data skewness, which is common for high dimensional data, we further develop a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability. Experimental evaluation using real data sets shows that our approach is considerably more scalable compared to state-of-the-art algorithms, especially for high dimensional data with low distance thresholds.

References

Linkedgeodata: http://linkedgeodata.org.Google Scholar
Openstreetmap: www.openstreetmap.org.Google Scholar
F. Afrati, A. Das Sarma, A. Rajaraman, P. Rule, S. Salihoglu, and J. Ullman. Anchor points algorithms for hamming and edit distance. In Proceedings of ICDT, 2014.Google Scholar
F. N. Afrati, A. D. Sarma, D. Menestrina, A. G. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In Proceedings of ICDE, 2012. Google ScholarDigital Library
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, 2006. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, 2007. Google ScholarDigital Library
K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? In Proceedings of ICDT, 1999. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Computer Networks, 1997. Google ScholarDigital Library
V. Bryant. Metric Spaces: Iteration and Application. Cambridge University Press, 1985.Google ScholarCross Ref
R. Chaiken, B. Jenkins, P. Larson, and B. Ramsey. Scope: Easy and efficient parallel processing of massive data sets. In Proceedings of VLDB, 2008. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, 2006. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Communication of ACM, 2010. Google ScholarDigital Library
D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based algorithm for string similarity joins. In Proceedings of ICDE, 2013.Google Scholar
A. Doan, A. Halevy, and Z. Ives. Principles of data integration. In Morgan Kaufmann, 2012. Google ScholarDigital Library
D. M. Endres and J. E. Schindelin. A new metric for probability distributions. In IEEE Trans. Inf. Theory, 2003. Google ScholarDigital Library
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of SIGIR, 2006. Google ScholarDigital Library
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In JASIST 54(3), 2003. Google ScholarDigital Library
A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. In Proceedings of VLDB, 2012. Google ScholarDigital Library
D. A. Nehme-Haily. The set-union knapsack problem. University of Texas at Austin, Thesis, 1995. Google ScholarDigital Library
A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial tessellations: Concepts and applications of voronoi diagrams. 2009.Google ScholarDigital Library
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of SIGMOD, 2011. Google ScholarDigital Library
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. In Proceedings of VLDB, 2012. Google ScholarDigital Library
Y. N. Silva and J. M. Reed. Exploiting mapreduce-based similarity joins. In Proceedings of SIGMOD, 2012. Google ScholarDigital Library
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD, 2010. Google ScholarDigital Library
Y. Wang, A. Metwally, and S. Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proceedings of KDD, 2013. Google ScholarDigital Library
W. Winkler. The state of record linkage and current research problems. In Technical Report, US Bureau of Census, 1999.Google Scholar
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, 2008. Google ScholarDigital Library

Recommendations

A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition

Among the most interesting measures in intuitionistic fuzzy sets (IFSs) theory, the similarity measure is an essential tool to compare and determine degree of similarity between IFSs. Although there exist many similarity measures for IFSs, most of them ...
Read More
Single-valued neutrosophic similarity measures based on cotangent function and their application in the fault diagnosis of steam turbine

Similarity measure is an important tool in pattern recognition and fault diagnosis. This paper proposes two cotangent similarity measures for single-valued neutrosophic sets (SVNSs) based on cotangent function. Then, the weighted cotangent similarity ...
Read More
Elementary theory of similarity and its application in biology and geography: Multiple-site measures of similarity and dissimilarity

A system of axioms of multiple-site similarity measures is considered that is defined on a series of finite descriptive sets. The concept of equivalence of multiple-site similarity measures is proposed. Equivalence classes of multiple-site similarity ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 12
August 2014
296 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 346
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ClusterJoin: a similarity joins framework using map-reduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition

Single-valued neutrosophic similarity measures based on cotangent function and their application in the fault diagnosis of steam turbine

Elementary theory of similarity and its application in biology and geography: Multiple-site measures of similarity and dissimilarity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ClusterJoin: a similarity joins framework using map-reduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition

Single-valued neutrosophic similarity measures based on cotangent function and their application in the fault diagnosis of steam turbine

Elementary theory of similarity and its application in biology and geography: Multiple-site measures of similarity and dissimilarity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media