research-article

Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

Authors:
Yufei Tao

Chinese University of Hong Kong, Sha Tin, Hong Kong

Chinese University of Hong Kong, Sha Tin, Hong Kong
View Profile

,
Ke Yi

Hong Kong University of Science and Technology, Hong Kong

Hong Kong University of Science and Technology, Hong Kong
View Profile

,
Cheng Sheng

Chinese University of Hong Kong, Sha Tin, Hong Kong

Chinese University of Hong Kong, Sha Tin, Hong Kong
View Profile

,
Panos Kalnis

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 35 Issue 3Article No.: 20pp 1–46https://doi.org/10.1145/1806907.1806912

Published:30 July 2010Publication History

ACM Transactions on Database Systems

Abstract

Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results.

Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders of magnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results.

As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results.

References

Andoni, A. and Indyk, P. 2006. Near-Optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). 459--468. Google ScholarDigital Library
Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. Y. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6, 891--923. Google ScholarDigital Library
Athitsos, V., Potamias, M., Papapetrou, P., and Kollios, G. 2008. Nearest neighbor retrieval using distance-based hashing. In Proceedings of the International Conference on Data Engineering (ICDE'08). 327--336. Google ScholarDigital Library
Bawa, M., Condie, T., and Ganesan, P. 2005. Lsh forest: Self-Tuning indexes for similarity search. In Proceedings of the International World Wide Web Conference (WWW'05). 651--660. Google ScholarDigital Library
Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R&ast;-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 322--331. Google ScholarDigital Library
Bennett, K. P., Fayyad, U., and Geiger, D. 1999. Density-Based indexing for approximate nearest-neighbor queries. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 233--243. Google ScholarDigital Library
Berchtold, S., Bohm, C., Jagadish, H. V., Kriegel, H.-P., and Sander, J. 2000. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the International Conference on Data Engineering (ICDE'00). 577--588. Google ScholarDigital Library
Berchtold, S., Keim, D. A., Kriegel, H.-P., and Seidl, T. 2000. Indexing the solution space: A new technique for nearest neighbor search in high-dimensional space. IEEE Trans. Knowl. Data Engin. 12, 1, 45--57. Google ScholarDigital Library
Beyer, K. S., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is “nearest neighbor” meaningful&quest; In Proceedings of the International Conference on Database Theory (ICDT'99). 217--235. Google ScholarDigital Library
Bohm, C. 2000. A cost model for query processing in high dimensional data spaces. ACM Trans. Datab. Syst. 25, 2, 129--178. Google ScholarDigital Library
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 93--104. Google ScholarDigital Library
Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). Google ScholarDigital Library
Chaudhuri, S. and Gravano, L. 1999. Evaluating top-k selection queries. In Proceedings of the International Conference on Very Large Databases (VLDB'99). 397--410. Google ScholarDigital Library
Chen, C.-M. and Ling, Y. 2002. A sampling-based estimator for top-k query. In Proceedings of the Interational Conference on Data Engineering (ICDE'02). 617--627. Google ScholarDigital Library
Ciaccia, P. and Patella, M. 2000. Pac nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In Proceedings of the International Conference on Data Engineering (ICDE'00). 244--255. Google ScholarDigital Library
Corral, A., Manolopoulos, Y., Theodoridis, Y., and Vassilakopoulos, M. 2000. Closest pair queries in spatial databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 189--200. Google ScholarDigital Library
Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. 2004. Locality-Sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual ACM Symposium on Computational Geometry (SoCG'04). 253--262. Google ScholarDigital Library
Fagin, R., Kumar, R., and Sivakumar, D. 2003. Efficient similarity search and classification via rank aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 301--312. Google ScholarDigital Library
Fagin, R., Lotem, A., and Naor, M. 2001. Optimal aggregation algorithms for middleware. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'01). 102--113. Google ScholarDigital Library
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., and Abbadi, A. E. 2001. Approximate nearest neighbor searching in multimedia databases. In Proceedings of the International Conference on Data Engineering (ICDE'01). 503--511. Google ScholarDigital Library
Ferragina, P. and Grossi, R. 1999. The String B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280. Google ScholarDigital Library
Gaede, V. and Gunther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231. Google ScholarDigital Library
Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Databases (VLDB'99). 518--529. Google ScholarDigital Library
Goldstein, J. and Ramakrishnan, R. 2000. Contrast plots and p-sphere trees: Space vs. time in nearest neighbour searches. In Proceedings of the International Conference on Very Large Databases (VLDB'00). 429--440. Google ScholarDigital Library
Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google ScholarDigital Library
Har-Peled, S. 2001. A replacement for voronoi diagrams of near linear size. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'01). 94--103. Google ScholarDigital Library
Hjaltason, G. R. and Samet, H. 1998. Incremental distance join algorithms for spatial databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 237--248. Google ScholarDigital Library
Hjaltason, G. R. and Samet, H. 1999. Distance browsing in spatial databases. ACM Trans. Datab. Syst. 24, 2, 265--318. Google ScholarDigital Library
Houle, M. E. and Sakuma, J. 2005. Fast approximate similarity search in extremely high- dimensional data sets. In Proceedings of the International Conference on Data Engineering (ICDE'05). 619--630. Google ScholarDigital Library
Indyk, P., Lewenstein, M., Lipsky, O., and Porat, E. 2004. Closest pair problems in very high dimensions. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP'04). 782--792.Google Scholar
Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'98). 604--613. Google ScholarDigital Library
Jagadish, H. V., Ooi, B. C., Tan, K.-L., Yu, C., and Zhang, R. 2005. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Datab. Syst. 30, 2, 364--397. Google ScholarDigital Library
Kleinberg, J. M. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'97). 599--608. Google ScholarDigital Library
Korn, F., Pagel, B.-U., and Faloutsos, C. 2001. On the 'dimensionality curse' and the 'self-similarity blessing'. IEEE Trans. Knowl. Data Engin. 13, 1, 96--111. Google ScholarDigital Library
Koudas, N., Ooi, B. C., Shen, H. T., and Tung, A. K. H. 2004. Ldc: Enabling search by partial distance in a hyper-dimensional space. In Proceedings of the International Conference on Data Engineering (ICDE'04). 6--17. Google ScholarDigital Library
Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'04). 798--807. Google ScholarDigital Library
Lenhof, H.-P. and Smid, M. 1992. Enumerating the k closest pairs optimally. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'92). 380--386. Google ScholarDigital Library
Li, C., Chang, E. Y., Garcia-Molina, H., and Wiederhold, G. 2002. Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans. Knowl. Data Engin. 14, 4, 792--808. Google ScholarDigital Library
Lin, K.-I., Jagadish, H. V., and Faloutsos, C. 1994. The tv-tree: An index structure for high-dimensional data. VLDB J. 3, 4, 517--542. Google ScholarDigital Library
Lopez, M. A. and Liao, S. 2000. Finding k-closest-pairs efficiently for high dimensional data. In Proceedings of the Canadian Conference on Computational Geometry (CCCG'00). 197--204.Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. 2007. Multi-Probe lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 950--961. Google ScholarDigital Library
Panigrahy, R. 2006. Entropy based nearest neighbor search in high dimensions. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). 1186--1195. Google ScholarDigital Library
Roussopoulos, N., Kelley, S., and Vincent, F. 1995. Nearest neighbor queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 71--79. Google ScholarDigital Library
Shamos, M. I. and Hoey, D. 1975. Closest-point problems. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'75). 151--162. Google ScholarDigital Library
Tao, Y., Yi, K., Sheng, C., and Kalnis, P. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 563--576. Google ScholarDigital Library
Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Databases (VLDB'98). 194--205. Google ScholarDigital Library
Wong, R. C.-W., Tao, Y., Fu, A. W.-C., and Xiao, X. 2007. On efficient spatial matching. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 579--590. Google ScholarDigital Library

Index Terms

Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
  2. Information retrieval

Recommendations

Quality and efficiency in high dimensional nearest neighbor search
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset ...
Read More
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and Applications
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
Read More
A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space

A fast approximate nearest neighbor search algorithm for the (binary) Hamming space is proposed. The proposed Error Weighted Hashing (EWH) algorithm is up to 20 times faster than the popular locality sensitive hashing (LSH) algorithm and works well even ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 35, Issue 3
July 2010
311 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1806907
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2010
- Accepted: 1 February 2010
- Received: 1 September 2009
Published in tods Volume 35, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Locality-sensitive hashing
closest pair search
nearest neighbor search
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 1,454
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Quality and efficiency in high dimensional nearest neighbor search

Confirmation Sampling for Exact Nearest Neighbor Search

A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Quality and efficiency in high dimensional nearest neighbor search

Confirmation Sampling for Exact Nearest Neighbor Search

A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media