ABSTRACT
All pairs similarity search is used in many web search and data mining applications. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-step approach called Partition-based Similarity Search (PSS) which incorporates several optimization techniques. First, PSS uses a static partitioning algorithm that places dissimilar vectors into different groups and balance the comparison workload with a circular assignment. Second, PSS executes comparison tasks in parallel, each using a hybrid data structure that combines the advantages of forward and inverted indexing. Our evaluation results show that the proposed approach leads to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency. As a result, it improves performance by an order of magnitude when dealing with large datasets.
- G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, june 2005. Google ScholarDigital Library
- Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, VLDB '06, 2006. Google ScholarDigital Library
- Language Technologies Institute at Carnegie Mellon University. The clueweb09 dataset, http://boston.lti.cs.cmu.edu/data/clueweb09.Google Scholar
- Amit Awekar and Nagiza F. Samatova. Fast matching for all pairs similarity search. In International Conference on Web Intelligence and Intelligent Agent Technology Workshop, 2009. Google ScholarDigital Library
- Amit Awekar and Nagiza F. Samatova. Parallel all pairs similarity search. In International Conference on Information Knowledge Engineering, 2010.Google Scholar
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
- Ranieri Baraglia, Gianmarco De Francisci Morales, and Claudio Lucchese. Document similarity self-join with mapreduce. In ICDM, 2010. Google ScholarDigital Library
- Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, 2007. Google ScholarDigital Library
- Abdur Chowdhury, Ophir Frieder, David A. Grossman, and M. Catherine McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of OSDI 2004. Google ScholarDigital Library
- Jimmy Lin Ferhan Ture1, Tamer Elsayed2. No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. 2011.Google Scholar
- Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
- Hannaneh Hajishirzi, Wen tau Yih, and Aleksander Kolcz. Adaptive near-duplicate detection via similarity learning. In Proc. of ACM SIGIR, pages 419--426, 2010. Google ScholarDigital Library
- Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the international conference on Web search and web data mining, WSDM '08, pages 219--230, 2008. Google ScholarDigital Library
- Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of KDD, pages 605--610, 2004. Google ScholarDigital Library
- Jimmy Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In SIGIR, 2009. Google ScholarDigital Library
- Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Detectives: detecting coalition hit inflation attacks in advertising networks streams. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 241--250, 2007. Google ScholarDigital Library
- Gianmarco De Francisci Morales, Claudio Lucchese, and Ranieri Baraglia. Scaling out all pairs similarity search with mapreduce. In 8th Workshop on LargeScale Distributed Systems for Information Retrieval (2010), 2010.Google Scholar
- Mehran Sahami and Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, WWW '06, pages 377--386, 2006. Google ScholarDigital Library
- Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986. Google ScholarDigital Library
- Narayanan Shivakumar and Hector Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proceedings of the first ACM international conference on Digital libraries, DL '96, pages 160--168, 1996. Google ScholarDigital Library
- Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 international conference on Management of data, SIGMOD '10, 2010. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceeding of the 17th international conference on World Wide Web, WWW '08. ACM, 2008. Google ScholarDigital Library
- Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu, and Tao Yang. Clustering and load balancing optimization for redundant content removal. In Proc. of WWW 2012 (21th international conference on World Wide Web), Industry Track, 2012. Google ScholarDigital Library
- Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), July 2006. Google ScholarDigital Library
Index Terms
- Optimizing parallel algorithms for all pairs similarity search
Recommendations
Load balancing for partition-based similarity search
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalAll pairs similarity search, used in many data mining and information retrieval applications, is a time consuming process. Although a partition-based approach accelerates this process by simplifying parallelism management and avoiding unnecessary I/O ...
A parallel BSP algorithm for irregular dynamic programming
APPT'07: Proceedings of the 7th international conference on Advanced parallel processing technologiesDynamic programming is a widely applied algorithm design technique in many areas such as computational biology and scientific computing. Typical applications using this technique are compute-intensive and suffer from long runtimes on sequential ...
Tree-Based Parallel Load-Balancing Methods for Solution-Adaptive Finite Element Graphs on Distributed Memory Multicomputers
To solve the load imbalance problem of a solution-adaptive finite element application program on a distributed memory multicomputer, nodes of a refined finite element graph can be remapped to processors or load of a refined finite element graph can be ...
Comments