skip to main content
10.1145/1060745.1060840acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

LSH forest: self-tuning indexes for similarity search

Published:10 May 2005Publication History

ABSTRACT

We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.

References

  1. K. Aberer. Scalable data access in p2p systems using unbalanced search trees. In Proc. WDAS, 2002.Google ScholarGoogle Scholar
  2. R. Bayer and K. Unterauer. Prefix b-trees. ACM Transactions on Database Systems, 2(1), 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Berchtold, C. Bohm, and H.-P. Kriegel. The pyramid-technique: Towards breaking the curse of dimensionality. In Proc. of SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Berchtold, D. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proc. of VLDB, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bharat and A. Broder. Mirror, mirror, on the web: A study of host pairs with replicated content. In Proc. of WWW, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proc. of SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder. On the resemblance and containment of documents. In Proc. of Compression and Complexity of Sequences, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. of WWW, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proc. of WWW, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. of SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. of ICDE, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Cooper, N. Sample, M. Franklin, and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Davison. Topical locality in the web. In Proc. of SIGIR, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proc. of WWW, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman. Computing iceberg queries efficiently. In Proc. of VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Ferragina and R. Grossi. The string b-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of range-partitioned data with applications to peer-to-peer systems. In Proc. VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In Proc. of Hypertext and Hypermedia, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Gutman. R-trees: A dynamic index structure for spatial searching. In Proc. of SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In Proc. of WWW, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Indyk and R. Motwani. Approximate nearest neighbor - towards removing the curse of dimensionality. In Proc. of STOC, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Katayama and S. Satoh. The r*-tree: An efficient and robust access method for points and rectangles. In Proc. of SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of SODA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Lee. Measures of distributional similarity. In Proc. of ACL, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Lomet. The evolution of effective b-tree: Page organization and techniques: A personal account. SIGMOD Record, 30(3), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. McCreight. Pagination of b*-trees with variable-length records. Communications of the ACM, 20(9), 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514--534, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Robinson. The k-d-b tree: A search structure for large multidimensional indexes. In Proc. of SIGMOD, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proc. of ACM DL, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. White and R. Jain. Similarity indexing with ss-tree. In Proc. of ICDE, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LSH forest: self-tuning indexes for similarity search

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                WWW '05: Proceedings of the 14th international conference on World Wide Web
                May 2005
                781 pages
                ISBN:1595930469
                DOI:10.1145/1060745

                Copyright © 2005 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 10 May 2005

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                Overall Acceptance Rate1,899of8,196submissions,23%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader