skip to main content
research-article

V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Published:01 April 2012Publication History
Skip Abstract Section

Abstract

This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies.

References

  1. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  2. A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proceedings of the 32nd VLDB International Conference on Very Large Data Bases, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baraglia, G. De Francisci Morales, and C. Lucchese. Document Similarity Self-Join with MapReduce. In Proceedings of the 10th IEEE ICDM International Conference on Data Mining, pages 731--736, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In Proceedings of the 16th WWW International Conference on World Wide Web, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Broder. On the Resemblance and Containment of Documents. In Proceedings of the IEEE SEQUENCES Compression and Complexity of Sequences, pages 21--29, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Proceedings of the 6th WWW International Conference on World Wide Web, pages 391--404, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S.-H. Cha. Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4):300--307, 2007.Google ScholarGoogle Scholar
  8. S.-H. Cha and S. Srihari. On Measuring the Distance between Histograms. Pattern Recognition, 35(6):1355--1370, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of the 34th ACM STOC Symposium on Theory Of Computing, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, V. Ganti, and R. Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the 22nd IEEE International Conference on Data Engineering, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. Mapreduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th USENIX OSDI Symposium on Operating System Design and Implementation, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. eHarmony Dating Site. http://www.eharmony.com.Google ScholarGoogle Scholar
  13. T. Elsayed, J. Lin, and D. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In Proceedings of the 46th HLT Meeting of the ACL on Human Language Technologies: Short Papers, pages 265--268, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the 19th ACM SOSP Symposium on Operating Systems Principles, pages 29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gibbs and F. Su. On Choosing and Bounding Probability Metrics. The International Statistical Review, 70(3):419--435, 2002.Google ScholarGoogle Scholar
  16. K. Grauman and T. Darrell. Approximate Correspondences in High Dimensions. In Proceedings of the 16th WWW International Conference on World Wide Web, pages 505--512, 2006.Google ScholarGoogle Scholar
  17. M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the 19th ACM STOC Symposium on Theory Of Computing, pages 604--613, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Li, J. Lu, and Y. Lu. Efficient Merging and Filtering Algorithms for Approximate String Searches. In Proceedings of the ICDE 42nd IEEE International Conference on Data Engineering, pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. Blog Community Discovery and Evolution Based on Mutual Awareness Expansion. In Proceedings of the 6th IEEE/WIC/ACM WI International Conference on Web Intelligence, pages 48--56, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Linn and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies Series. Morgan & Claypool Publishers, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Metwally, D. Agrawal, and A. El Abbadi. DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams. In Proceedings of the 16th WWW International Conference on World Wide Web, pages 241--250, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Metwally, F. Emekçi, D. Agrawal, and A. El Abbadi. SLEUTH: Single-pubLisher attack dEtection Using correlaTion Hunting. Proceedings of the VLDB Endowment, 1(2):1217--1228, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Metwally and M. Paduano. Estimating the Number of Users Behind IP Addresses for Combating Abusive Traffic. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 249--257, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In Proceedings of the 28th ACM SIGMOD International Conference on Management of Data, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13(4):277--298, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Ruan and W. Zhang. An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks. In Proceedings of the 7th IEEE ICDM International Conference on Data Mining, pages 643--648, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Rubner, C. Tomasi, and L. Guibas. A Metric for Distributions with Applications to Image Databases. In Proceedings of the 6th IEEE ICCV International Conference on Computer Vision, pages 59--66, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Sarawagi and A. Kirpal. Efficient Set Joins on Similarity Predicates. In Proceedings of the 24th ACM SIGMOD International Conference on Management of Data, pages 743--754, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Satuluri and S. Parthasarathy. Scalable Graph Clustering Using Stochastic Flows: Applications to Community Discovery. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 737--746, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Stanley. Enumerative Combinatorics, volume 1. Cambridge University Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, August 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Vernica, M. Carey, and C. Li. Efficient Parallel Set-Similarity Joins Using MapReduce. In Proceedings of the 30th ACM SIGMOD International Conference on Management of Data, pages 495--506, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient Similarity Joins for Near Duplicate Detection. In Proceedings of the 18th WWW International Conference on World Wide Web, pages 131--140, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the 8th USENIX OSDI Conference on Operating Systems Design and Implementation, pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Zhang, C. Giles, H. Foley, and J. Yen. Probabilistic Community Discovery Using Hierarchical Latent Gaussian Mixture Model. In Proceedings of the 22nd AAAI National Conference on Artificial Intelligence, pages 663--668, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 8
    April 2012
    96 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 April 2012
    Published in pvldb Volume 5, Issue 8

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader