Abstract
Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based on the distance threshold. We design a set of strong candidate filters specific to different distance functions using a novel bisector-based framework, so that each record only needs to be distributed to a small number of partitions while still guaranteeing correctness. To address data skewness, which is common for high dimensional data, we further develop a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability. Experimental evaluation using real data sets shows that our approach is considerably more scalable compared to state-of-the-art algorithms, especially for high dimensional data with low distance thresholds.
- Linkedgeodata: http://linkedgeodata.org.Google Scholar
- Openstreetmap: www.openstreetmap.org.Google Scholar
- F. Afrati, A. Das Sarma, A. Rajaraman, P. Rule, S. Salihoglu, and J. Ullman. Anchor points algorithms for hamming and edit distance. In Proceedings of ICDT, 2014.Google Scholar
- F. N. Afrati, A. D. Sarma, D. Menestrina, A. G. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In Proceedings of ICDE, 2012. Google ScholarDigital Library
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, 2006. Google ScholarDigital Library
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, 2007. Google ScholarDigital Library
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? In Proceedings of ICDT, 1999. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Computer Networks, 1997. Google ScholarDigital Library
- V. Bryant. Metric Spaces: Iteration and Application. Cambridge University Press, 1985.Google ScholarCross Ref
- R. Chaiken, B. Jenkins, P. Larson, and B. Ramsey. Scope: Easy and efficient parallel processing of massive data sets. In Proceedings of VLDB, 2008. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, 2006. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Communication of ACM, 2010. Google ScholarDigital Library
- D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based algorithm for string similarity joins. In Proceedings of ICDE, 2013.Google Scholar
- A. Doan, A. Halevy, and Z. Ives. Principles of data integration. In Morgan Kaufmann, 2012. Google ScholarDigital Library
- D. M. Endres and J. E. Schindelin. A new metric for probability distributions. In IEEE Trans. Inf. Theory, 2003. Google ScholarDigital Library
- M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of SIGIR, 2006. Google ScholarDigital Library
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In JASIST 54(3), 2003. Google ScholarDigital Library
- A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. In Proceedings of VLDB, 2012. Google ScholarDigital Library
- D. A. Nehme-Haily. The set-union knapsack problem. University of Texas at Austin, Thesis, 1995. Google ScholarDigital Library
- A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial tessellations: Concepts and applications of voronoi diagrams. 2009.Google ScholarDigital Library
- A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of SIGMOD, 2011. Google ScholarDigital Library
- V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. In Proceedings of VLDB, 2012. Google ScholarDigital Library
- Y. N. Silva and J. M. Reed. Exploiting mapreduce-based similarity joins. In Proceedings of SIGMOD, 2012. Google ScholarDigital Library
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD, 2010. Google ScholarDigital Library
- Y. Wang, A. Metwally, and S. Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proceedings of KDD, 2013. Google ScholarDigital Library
- W. Winkler. The state of record linkage and current research problems. In Technical Report, US Bureau of Census, 1999.Google Scholar
- C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, 2008. Google ScholarDigital Library
Recommendations
A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition
Among the most interesting measures in intuitionistic fuzzy sets (IFSs) theory, the similarity measure is an essential tool to compare and determine degree of similarity between IFSs. Although there exist many similarity measures for IFSs, most of them ...
Single-valued neutrosophic similarity measures based on cotangent function and their application in the fault diagnosis of steam turbine
Similarity measure is an important tool in pattern recognition and fault diagnosis. This paper proposes two cotangent similarity measures for single-valued neutrosophic sets (SVNSs) based on cotangent function. Then, the weighted cotangent similarity ...
Elementary theory of similarity and its application in biology and geography: Multiple-site measures of similarity and dissimilarity
A system of axioms of multiple-site similarity measures is considered that is defined on a series of finite descriptive sets. The concept of equivalence of multiple-site similarity measures is proposed. Equivalence classes of multiple-site similarity ...
Comments