skip to main content
research-article

ClusterJoin: a similarity joins framework using map-reduce

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based on the distance threshold. We design a set of strong candidate filters specific to different distance functions using a novel bisector-based framework, so that each record only needs to be distributed to a small number of partitions while still guaranteeing correctness. To address data skewness, which is common for high dimensional data, we further develop a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability. Experimental evaluation using real data sets shows that our approach is considerably more scalable compared to state-of-the-art algorithms, especially for high dimensional data with low distance thresholds.

References

  1. Linkedgeodata: http://linkedgeodata.org.Google ScholarGoogle Scholar
  2. Openstreetmap: www.openstreetmap.org.Google ScholarGoogle Scholar
  3. F. Afrati, A. Das Sarma, A. Rajaraman, P. Rule, S. Salihoglu, and J. Ullman. Anchor points algorithms for hamming and edit distance. In Proceedings of ICDT, 2014.Google ScholarGoogle Scholar
  4. F. N. Afrati, A. D. Sarma, D. Menestrina, A. G. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In Proceedings of ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? In Proceedings of ICDT, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Computer Networks, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Bryant. Metric Spaces: Iteration and Application. Cambridge University Press, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  10. R. Chaiken, B. Jenkins, P. Larson, and B. Ramsey. Scope: Easy and efficient parallel processing of massive data sets. In Proceedings of VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Communication of ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based algorithm for string similarity joins. In Proceedings of ICDE, 2013.Google ScholarGoogle Scholar
  14. A. Doan, A. Halevy, and Z. Ives. Principles of data integration. In Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. M. Endres and J. E. Schindelin. A new metric for probability distributions. In IEEE Trans. Inf. Theory, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In JASIST 54(3), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. In Proceedings of VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. A. Nehme-Haily. The set-union knapsack problem. University of Texas at Austin, Thesis, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial tessellations: Concepts and applications of voronoi diagrams. 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. In Proceedings of VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. N. Silva and J. M. Reed. Exploiting mapreduce-based similarity joins. In Proceedings of SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Wang, A. Metwally, and S. Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proceedings of KDD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Winkler. The state of record linkage and current research problems. In Technical Report, US Bureau of Census, 1999.Google ScholarGoogle Scholar
  27. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 7, Issue 12
    August 2014
    296 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2014
    Published in pvldb Volume 7, Issue 12

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader