skip to main content
research-article
Free Access

Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Published:01 January 2008Publication History
Skip Abstract Section

Abstract

In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.

Skip Supplemental Material Section

Supplemental Material

References

  1. 1. Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the Fast Johnson-Lindenstrauss Transform. In Proceedings of the Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andoni, A. and Indyk, P. 2004. E2lsh: Exact Euclidean localitysensitive hashing. http://web.mit.edu/andoni/www/LSH/.Google ScholarGoogle Scholar
  3. Andoni, A. and Indyk, P. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Symposium on Foundations of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andoni, A. and Indyk, P. 2006. Efficient algorithms for substring near neighbor problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 1203--1212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. 1994. An optimal algorithm for approximate nearest neighbor searching. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 573--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 1998. Min-wise independent permutations. J. Comput. Sys. Sci. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference. 391--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences. 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Buhler, J. 2001. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinform. 17, 419--428.Google ScholarGoogle ScholarCross RefCross Ref
  11. Buhler, J. and Tompa, M. 2001. Finding motifs using random projections. In Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB1). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Califano, A. and Rigoutsos, I. 1993. Flash: A fast look-up algorithm for string homology. In Proceedings of the IEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  13. Chakrabarti, A. and Regev, O. 2004. An optimal randomised cell probe lower bounds for approximate nearest neighbor searching. In Proceedings of the Symposium on Foundations of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Charikar, M. 2002. Similarity estimation techniques from rounding. In Proceedings of the Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Charikar, M., Chekuri, C., Goel, A., Guha, S., and Plotkin, S. 1998. Approximating a finite metric by a small number of tree metrics. In Proceedings of the Symposium on Foundations of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduct. Algorithms. 2nd Ed. MIT Press.Google ScholarGoogle Scholar
  17. Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. 2004. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the ACM Symposium on Computational Geometry. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dutta, D., Guha, R., Jurs, C., and Chen, T. 2006. Scalable partitioning and exploration of chemical spaces using geometric hashing. J. Chem. Inf. Model. 46.Google ScholarGoogle ScholarCross RefCross Ref
  19. Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Goemans, M. and Williamson, D. 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42. 1115--1145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Greene, D., Parnas, M., and Yao, F. 1994. Multi-index hashing for information retrieval. In Proceedings of the Symposium on Foundations of Computer Science. 722--731.Google ScholarGoogle Scholar
  22. Har-Peled, S. 2001. A replacement for voronoi diagrams of near linear size. In Proceedings of the Symposium on Foundations of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haveliwala, T., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the web. WebDB Workshop.Google ScholarGoogle Scholar
  24. Indyk, P. 2003. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC Press.Google ScholarGoogle Scholar
  25. Indyk, P. and Motwani, R. 1998. Approximate nearest neighbor: Towards removing the curse of dimensionality. In Proceedings of the Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Karp, R. M., Waarts, O., and Zweig, G. 1995. The bit vector intersection problem. In Proceedings of the Symposium on Foundations of Computer Science. pages 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kleinberg, J. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the Symposium on Theory of Computing. 614--623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Linial, N., London, E., and Rabinovich, Y. 1994. The geometry of graphs and some of its algorithmic applications. In Proceedings of the Symposium on Foundations of Computer Science. 577--591.Google ScholarGoogle Scholar
  31. Motwani, R., Naor, A., and Panigrahy, R. 2006. Lower bounds on locality sensitive hashing. In Proceedings of the ACM Symposium on Computational Geometry. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Panigrahy, R. 2006. Entropy-based nearest neighbor algorithm in high dimensions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Paturi, R., Rajasekaran, S., and Reif, J.The light bulb problem. Inform. Comput. 117, 2, 187--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ravichandran, D., Pantel, P., and Hovy, E. 2005. Randomized algorithms and nlp: Using locality sensitive hash functions for high speed noun clustering. In Proceedings of the Annual Meeting of the Association of Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Elsevier, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Shakhnarovich, G., Darrell, T., and Indyk, P. Eds. Nearest Neighbor Methods in Learning and Vision. Neural Processing Information Series, MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Terasawa, T. and Tanaka, Y. 2007. Spherical lsh for approximate nearest neighbor search on unit hypersphere. In Proceedings of the Workshop on Algorithms and Data Structures. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

            Recommendations

            Reviews

            Amos O Olagunju

            Computerized document, image, and video retrieval systems require efficient storage and algorithms for fast access to objects. Designing efficient algorithms for constructing data structures that support fast retrieval of objects in a high-dimensional Euclidean space raises interesting questions. How should an effective data structure be constructed for a set of object points to support the retrieval of the object point closest to a query__?__ How should features of objects be represented as points in high-dimensional spaces to generate distance metrics that measure the similarities of objects__?__ How should cataloging or similarity probing for query objects be implemented__?__ Diverse applications of assorted data prearranged as shades of points in high dimensions are available in the literature [1]. Andoni and Indyk present near-optimal locality-sensitive hashing (LSH) algorithms for solving approximate and exact nearest neighbor problems in high dimensions. The LSH scheme hashes object points using several hash functions to minimize the probabilities of collisions for objects far apart. The object query points are hashed to locate near neighbors, and the corresponding elements in stored buckets are retrieved. A data structure is created for a set of points in a high-dimensional space, in such a way that the algorithm reports?with a certainty probability?either some or all near neighbors of a queried point. The authors ingeniously design a family of LSH functions that produce a remarkable exponential query runtime. They recognize that it is clearly impossible to provide an assumed query runtime bound for a data structure capable of reporting all or many data points in close proximity of a query point. They insightfully critique the most viable data structures and LSH algorithms for solving the well-known approximate and exact nearest neighbor problems. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image Communications of the ACM
              Communications of the ACM  Volume 51, Issue 1
              50th anniversary issue: 1958 - 2008
              January 2008
              106 pages
              ISSN:0001-0782
              EISSN:1557-7317
              DOI:10.1145/1327452
              Issue’s Table of Contents

              Copyright © 2008 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 January 2008

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Popular
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader