skip to main content
research-article

Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

Authors Info & Claims
Published:30 July 2010Publication History
Skip Abstract Section

Abstract

Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results.

Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders of magnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results.

As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results.

References

  1. Andoni, A. and Indyk, P. 2006. Near-Optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). 459--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. Y. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6, 891--923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Athitsos, V., Potamias, M., Papapetrou, P., and Kollios, G. 2008. Nearest neighbor retrieval using distance-based hashing. In Proceedings of the International Conference on Data Engineering (ICDE'08). 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bawa, M., Condie, T., and Ganesan, P. 2005. Lsh forest: Self-Tuning indexes for similarity search. In Proceedings of the International World Wide Web Conference (WWW'05). 651--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 322--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bennett, K. P., Fayyad, U., and Geiger, D. 1999. Density-Based indexing for approximate nearest-neighbor queries. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 233--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Berchtold, S., Bohm, C., Jagadish, H. V., Kriegel, H.-P., and Sander, J. 2000. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the International Conference on Data Engineering (ICDE'00). 577--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Berchtold, S., Keim, D. A., Kriegel, H.-P., and Seidl, T. 2000. Indexing the solution space: A new technique for nearest neighbor search in high-dimensional space. IEEE Trans. Knowl. Data Engin. 12, 1, 45--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Beyer, K. S., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is “nearest neighbor” meaningful? In Proceedings of the International Conference on Database Theory (ICDT'99). 217--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bohm, C. 2000. A cost model for query processing in high dimensional data spaces. ACM Trans. Datab. Syst. 25, 2, 129--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chaudhuri, S. and Gravano, L. 1999. Evaluating top-k selection queries. In Proceedings of the International Conference on Very Large Databases (VLDB'99). 397--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chen, C.-M. and Ling, Y. 2002. A sampling-based estimator for top-k query. In Proceedings of the Interational Conference on Data Engineering (ICDE'02). 617--627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ciaccia, P. and Patella, M. 2000. Pac nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In Proceedings of the International Conference on Data Engineering (ICDE'00). 244--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Corral, A., Manolopoulos, Y., Theodoridis, Y., and Vassilakopoulos, M. 2000. Closest pair queries in spatial databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 189--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. 2004. Locality-Sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual ACM Symposium on Computational Geometry (SoCG'04). 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fagin, R., Kumar, R., and Sivakumar, D. 2003. Efficient similarity search and classification via rank aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fagin, R., Lotem, A., and Naor, M. 2001. Optimal aggregation algorithms for middleware. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'01). 102--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., and Abbadi, A. E. 2001. Approximate nearest neighbor searching in multimedia databases. In Proceedings of the International Conference on Data Engineering (ICDE'01). 503--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ferragina, P. and Grossi, R. 1999. The String B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gaede, V. and Gunther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Databases (VLDB'99). 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Goldstein, J. and Ramakrishnan, R. 2000. Contrast plots and p-sphere trees: Space vs. time in nearest neighbour searches. In Proceedings of the International Conference on Very Large Databases (VLDB'00). 429--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Har-Peled, S. 2001. A replacement for voronoi diagrams of near linear size. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'01). 94--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hjaltason, G. R. and Samet, H. 1998. Incremental distance join algorithms for spatial databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hjaltason, G. R. and Samet, H. 1999. Distance browsing in spatial databases. ACM Trans. Datab. Syst. 24, 2, 265--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Houle, M. E. and Sakuma, J. 2005. Fast approximate similarity search in extremely high- dimensional data sets. In Proceedings of the International Conference on Data Engineering (ICDE'05). 619--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Indyk, P., Lewenstein, M., Lipsky, O., and Porat, E. 2004. Closest pair problems in very high dimensions. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP'04). 782--792.Google ScholarGoogle Scholar
  31. Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'98). 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jagadish, H. V., Ooi, B. C., Tan, K.-L., Yu, C., and Zhang, R. 2005. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Datab. Syst. 30, 2, 364--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kleinberg, J. M. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'97). 599--608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Korn, F., Pagel, B.-U., and Faloutsos, C. 2001. On the 'dimensionality curse' and the 'self-similarity blessing'. IEEE Trans. Knowl. Data Engin. 13, 1, 96--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Koudas, N., Ooi, B. C., Shen, H. T., and Tung, A. K. H. 2004. Ldc: Enabling search by partial distance in a hyper-dimensional space. In Proceedings of the International Conference on Data Engineering (ICDE'04). 6--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'04). 798--807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lenhof, H.-P. and Smid, M. 1992. Enumerating the k closest pairs optimally. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'92). 380--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Li, C., Chang, E. Y., Garcia-Molina, H., and Wiederhold, G. 2002. Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans. Knowl. Data Engin. 14, 4, 792--808. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lin, K.-I., Jagadish, H. V., and Faloutsos, C. 1994. The tv-tree: An index structure for high-dimensional data. VLDB J. 3, 4, 517--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lopez, M. A. and Liao, S. 2000. Finding k-closest-pairs efficiently for high dimensional data. In Proceedings of the Canadian Conference on Computational Geometry (CCCG'00). 197--204.Google ScholarGoogle Scholar
  41. Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. 2007. Multi-Probe lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 950--961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Panigrahy, R. 2006. Entropy based nearest neighbor search in high dimensions. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). 1186--1195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Roussopoulos, N., Kelley, S., and Vincent, F. 1995. Nearest neighbor queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 71--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shamos, M. I. and Hoey, D. 1975. Closest-point problems. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'75). 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Tao, Y., Yi, K., Sheng, C., and Kalnis, P. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 563--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Databases (VLDB'98). 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wong, R. C.-W., Tao, Y., Fu, A. W.-C., and Xiao, X. 2007. On efficient spatial matching. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 579--590. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Database Systems
        ACM Transactions on Database Systems  Volume 35, Issue 3
        July 2010
        311 pages
        ISSN:0362-5915
        EISSN:1557-4644
        DOI:10.1145/1806907
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 July 2010
        • Accepted: 1 February 2010
        • Received: 1 September 2009
        Published in tods Volume 35, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader