ABSTRACT
We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.
- K. Aberer. Scalable data access in p2p systems using unbalanced search trees. In Proc. WDAS, 2002.Google Scholar
- R. Bayer and K. Unterauer. Prefix b-trees. ACM Transactions on Database Systems, 2(1), 1977. Google ScholarDigital Library
- S. Berchtold, C. Bohm, and H.-P. Kriegel. The pyramid-technique: Towards breaking the curse of dimensionality. In Proc. of SIGMOD, 1998. Google ScholarDigital Library
- S. Berchtold, D. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proc. of VLDB, 1996. Google ScholarDigital Library
- K. Bharat and A. Broder. Mirror, mirror, on the web: A study of host pairs with replicated content. In Proc. of WWW, 1999. Google ScholarDigital Library
- S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proc. of SIGMOD, 1995. Google ScholarDigital Library
- A. Broder. On the resemblance and containment of documents. In Proc. of Compression and Complexity of Sequences, 1998. Google ScholarDigital Library
- A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. of WWW, 1997. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proc. of WWW, 1998. Google ScholarDigital Library
- J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. of SIGMOD, 2000. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. of ICDE, 2000. Google ScholarDigital Library
- B. Cooper, N. Sample, M. Franklin, and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001. Google ScholarDigital Library
- B. Davison. Topical locality in the web. In Proc. of SIGIR, 2000. Google ScholarDigital Library
- J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proc. of WWW, 1999. Google ScholarDigital Library
- C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of SIGMOD, 1995. Google ScholarDigital Library
- M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman. Computing iceberg queries efficiently. In Proc. of VLDB, 1998. Google ScholarDigital Library
- P. Ferragina and R. Grossi. The string b-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2), 1999. Google ScholarDigital Library
- P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of range-partitioned data with applications to peer-to-peer systems. In Proc. VLDB, 2004. Google ScholarDigital Library
- D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In Proc. of Hypertext and Hypermedia, 1998. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of VLDB, 1999. Google ScholarDigital Library
- A. Gutman. R-trees: A dynamic index structure for spatial searching. In Proc. of SIGMOD, 1997. Google ScholarDigital Library
- T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In Proc. of WWW, 2002. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbor - towards removing the curse of dimensionality. In Proc. of STOC, 1998. Google ScholarDigital Library
- N. Katayama and S. Satoh. The r*-tree: An efficient and robust access method for points and rectangles. In Proc. of SIGMOD, 1997. Google ScholarDigital Library
- J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of SODA, 1998. Google ScholarDigital Library
- L. Lee. Measures of distributional similarity. In Proc. of ACL, 1999. Google ScholarDigital Library
- D. Lomet. The evolution of effective b-tree: Page organization and techniques: A personal account. SIGMOD Record, 30(3), 2001. Google ScholarDigital Library
- E. McCreight. Pagination of b*-trees with variable-length records. Communications of the ACM, 20(9), 1977. Google ScholarDigital Library
- D. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514--534, 1968. Google ScholarDigital Library
- J. Robinson. The k-d-b tree: A search structure for large multidimensional indexes. In Proc. of SIGMOD, 1981. Google ScholarDigital Library
- N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proc. of ACM DL, 1996. Google ScholarDigital Library
- R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. VLDB, 1998. Google ScholarDigital Library
- D. White and R. Jain. Similarity indexing with ss-tree. In Proc. of ICDE, 1996. Google ScholarDigital Library
Index Terms
- LSH forest: self-tuning indexes for similarity search
Recommendations
Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07: Proceedings of the 33rd international conference on Very large data basesSimilarity indices for high-dimensional data are very desirable for building content-based search systems for feature-rich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been ...
Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055Locality-Sensitive Hashing LSH is widely used to solve approximate nearest neighbor search problems in high-dimensional spaces. The basic idea is to map the "nearby" objects into a same hash bucket with high probability. A significant drawback is that ...
A query-adaptive partial distributed hash table for peer-to-peer systems
EDBT'04: Proceedings of the 2004 international conference on Current Trends in Database TechnologyThe two main approaches to find data in peer-to-peer (P2P) systems are unstructured networks using flooding and structured networks using a distributed index A distributed index is usually built over all keys that are stored in the network whether they ...
Comments