Article

LSH forest: self-tuning indexes for similarity search

Authors:
Mayank Bawa

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Tyson Condie

U. C. Berkeley, Berkeley, CA

U. C. Berkeley, Berkeley, CA
View Profile

,
Prasanna Ganesan

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

WWW '05: Proceedings of the 14th international conference on World Wide WebMay 2005Pages 651–660https://doi.org/10.1145/1060745.1060840

Published:10 May 2005Publication History

WWW '05: Proceedings of the 14th international conference on World Wide Web

Pages 651–660

ABSTRACT

We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.

References

K. Aberer. Scalable data access in p2p systems using unbalanced search trees. In Proc. WDAS, 2002.Google Scholar
R. Bayer and K. Unterauer. Prefix b-trees. ACM Transactions on Database Systems, 2(1), 1977. Google ScholarDigital Library
S. Berchtold, C. Bohm, and H.-P. Kriegel. The pyramid-technique: Towards breaking the curse of dimensionality. In Proc. of SIGMOD, 1998. Google ScholarDigital Library
S. Berchtold, D. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proc. of VLDB, 1996. Google ScholarDigital Library
K. Bharat and A. Broder. Mirror, mirror, on the web: A study of host pairs with replicated content. In Proc. of WWW, 1999. Google ScholarDigital Library
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proc. of SIGMOD, 1995. Google ScholarDigital Library
A. Broder. On the resemblance and containment of documents. In Proc. of Compression and Complexity of Sequences, 1998. Google ScholarDigital Library
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. of WWW, 1997. Google ScholarDigital Library
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proc. of WWW, 1998. Google ScholarDigital Library
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. of SIGMOD, 2000. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. of ICDE, 2000. Google ScholarDigital Library
B. Cooper, N. Sample, M. Franklin, and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001. Google ScholarDigital Library
B. Davison. Topical locality in the web. In Proc. of SIGIR, 2000. Google ScholarDigital Library
J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proc. of WWW, 1999. Google ScholarDigital Library
C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of SIGMOD, 1995. Google ScholarDigital Library
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman. Computing iceberg queries efficiently. In Proc. of VLDB, 1998. Google ScholarDigital Library
P. Ferragina and R. Grossi. The string b-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2), 1999. Google ScholarDigital Library
P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of range-partitioned data with applications to peer-to-peer systems. In Proc. VLDB, 2004. Google ScholarDigital Library
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In Proc. of Hypertext and Hypermedia, 1998. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of VLDB, 1999. Google ScholarDigital Library
A. Gutman. R-trees: A dynamic index structure for spatial searching. In Proc. of SIGMOD, 1997. Google ScholarDigital Library
T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In Proc. of WWW, 2002. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbor - towards removing the curse of dimensionality. In Proc. of STOC, 1998. Google ScholarDigital Library
N. Katayama and S. Satoh. The r*-tree: An efficient and robust access method for points and rectangles. In Proc. of SIGMOD, 1997. Google ScholarDigital Library
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of SODA, 1998. Google ScholarDigital Library
L. Lee. Measures of distributional similarity. In Proc. of ACL, 1999. Google ScholarDigital Library
D. Lomet. The evolution of effective b-tree: Page organization and techniques: A personal account. SIGMOD Record, 30(3), 2001. Google ScholarDigital Library
E. McCreight. Pagination of b*-trees with variable-length records. Communications of the ACM, 20(9), 1977. Google ScholarDigital Library
D. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514--534, 1968. Google ScholarDigital Library
J. Robinson. The k-d-b tree: A search structure for large multidimensional indexes. In Proc. of SIGMOD, 1981. Google ScholarDigital Library
N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proc. of ACM DL, 1996. Google ScholarDigital Library
R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. VLDB, 1998. Google ScholarDigital Library
D. White and R. Jain. Similarity indexing with ss-tree. In Proc. of ICDE, 1996. Google ScholarDigital Library

Index Terms

LSH forest: self-tuning indexes for similarity search
1. Information systems
  1. Information retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage
      2. Storage network architectures

Recommendations

Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07: Proceedings of the 33rd international conference on Very large data bases

Similarity indices for high-dimensional data are very desirable for building content-based search systems for feature-rich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been ...
Read More
Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055

Locality-Sensitive Hashing LSH is widely used to solve approximate nearest neighbor search problems in high-dimensional spaces. The basic idea is to map the "nearby" objects into a same hash bucket with high probability. A significant drawback is that ...
Read More
A query-adaptive partial distributed hash table for peer-to-peer systems
EDBT'04: Proceedings of the 2004 international conference on Current Trends in Database Technology

The two main approaches to find data in peer-to-peer (P2P) systems are unstructured networks using flooding and structured networks using a distributed index A distributed index is usually built over all keys that are stored in the network whether they ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '05: Proceedings of the 14th international conference on World Wide Web
May 2005
781 pages
ISBN:1595930469
DOI:10.1145/1060745
Conference Chairs:
Allan Ellis
Southern Cross University
,
Tatsuya Hagino
Keio University
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 May 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
peer-to-peer (P2P)
similarity indexes
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 260
  Total Citations
  View Citations
- 1,650
  Total Downloads
- Downloads (Last 12 months)78
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LSH forest: self-tuning indexes for similarity search

WWW '05: Proceedings of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-probe LSH: efficient indexing for high-dimensional similarity search

Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search

A query-adaptive partial distributed hash table for peer-to-peer systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

LSH forest: self-tuning indexes for similarity search

WWW '05: Proceedings of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-probe LSH: efficient indexing for high-dimensional similarity search

Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search

A query-adaptive partial distributed hash table for peer-to-peer systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media