Abstract
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results.
Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders of magnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results.
As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results.
- Andoni, A. and Indyk, P. 2006. Near-Optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). 459--468. Google ScholarDigital Library
- Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. Y. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6, 891--923. Google ScholarDigital Library
- Athitsos, V., Potamias, M., Papapetrou, P., and Kollios, G. 2008. Nearest neighbor retrieval using distance-based hashing. In Proceedings of the International Conference on Data Engineering (ICDE'08). 327--336. Google ScholarDigital Library
- Bawa, M., Condie, T., and Ganesan, P. 2005. Lsh forest: Self-Tuning indexes for similarity search. In Proceedings of the International World Wide Web Conference (WWW'05). 651--660. Google ScholarDigital Library
- Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 322--331. Google ScholarDigital Library
- Bennett, K. P., Fayyad, U., and Geiger, D. 1999. Density-Based indexing for approximate nearest-neighbor queries. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 233--243. Google ScholarDigital Library
- Berchtold, S., Bohm, C., Jagadish, H. V., Kriegel, H.-P., and Sander, J. 2000. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the International Conference on Data Engineering (ICDE'00). 577--588. Google ScholarDigital Library
- Berchtold, S., Keim, D. A., Kriegel, H.-P., and Seidl, T. 2000. Indexing the solution space: A new technique for nearest neighbor search in high-dimensional space. IEEE Trans. Knowl. Data Engin. 12, 1, 45--57. Google ScholarDigital Library
- Beyer, K. S., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is “nearest neighbor” meaningful? In Proceedings of the International Conference on Database Theory (ICDT'99). 217--235. Google ScholarDigital Library
- Bohm, C. 2000. A cost model for query processing in high dimensional data spaces. ACM Trans. Datab. Syst. 25, 2, 129--178. Google ScholarDigital Library
- Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 93--104. Google ScholarDigital Library
- Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). Google ScholarDigital Library
- Chaudhuri, S. and Gravano, L. 1999. Evaluating top-k selection queries. In Proceedings of the International Conference on Very Large Databases (VLDB'99). 397--410. Google ScholarDigital Library
- Chen, C.-M. and Ling, Y. 2002. A sampling-based estimator for top-k query. In Proceedings of the Interational Conference on Data Engineering (ICDE'02). 617--627. Google ScholarDigital Library
- Ciaccia, P. and Patella, M. 2000. Pac nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In Proceedings of the International Conference on Data Engineering (ICDE'00). 244--255. Google ScholarDigital Library
- Corral, A., Manolopoulos, Y., Theodoridis, Y., and Vassilakopoulos, M. 2000. Closest pair queries in spatial databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 189--200. Google ScholarDigital Library
- Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. 2004. Locality-Sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual ACM Symposium on Computational Geometry (SoCG'04). 253--262. Google ScholarDigital Library
- Fagin, R., Kumar, R., and Sivakumar, D. 2003. Efficient similarity search and classification via rank aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 301--312. Google ScholarDigital Library
- Fagin, R., Lotem, A., and Naor, M. 2001. Optimal aggregation algorithms for middleware. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'01). 102--113. Google ScholarDigital Library
- Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., and Abbadi, A. E. 2001. Approximate nearest neighbor searching in multimedia databases. In Proceedings of the International Conference on Data Engineering (ICDE'01). 503--511. Google ScholarDigital Library
- Ferragina, P. and Grossi, R. 1999. The String B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280. Google ScholarDigital Library
- Gaede, V. and Gunther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231. Google ScholarDigital Library
- Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Databases (VLDB'99). 518--529. Google ScholarDigital Library
- Goldstein, J. and Ramakrishnan, R. 2000. Contrast plots and p-sphere trees: Space vs. time in nearest neighbour searches. In Proceedings of the International Conference on Very Large Databases (VLDB'00). 429--440. Google ScholarDigital Library
- Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google ScholarDigital Library
- Har-Peled, S. 2001. A replacement for voronoi diagrams of near linear size. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'01). 94--103. Google ScholarDigital Library
- Hjaltason, G. R. and Samet, H. 1998. Incremental distance join algorithms for spatial databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 237--248. Google ScholarDigital Library
- Hjaltason, G. R. and Samet, H. 1999. Distance browsing in spatial databases. ACM Trans. Datab. Syst. 24, 2, 265--318. Google ScholarDigital Library
- Houle, M. E. and Sakuma, J. 2005. Fast approximate similarity search in extremely high- dimensional data sets. In Proceedings of the International Conference on Data Engineering (ICDE'05). 619--630. Google ScholarDigital Library
- Indyk, P., Lewenstein, M., Lipsky, O., and Porat, E. 2004. Closest pair problems in very high dimensions. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP'04). 782--792.Google Scholar
- Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'98). 604--613. Google ScholarDigital Library
- Jagadish, H. V., Ooi, B. C., Tan, K.-L., Yu, C., and Zhang, R. 2005. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Datab. Syst. 30, 2, 364--397. Google ScholarDigital Library
- Kleinberg, J. M. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'97). 599--608. Google ScholarDigital Library
- Korn, F., Pagel, B.-U., and Faloutsos, C. 2001. On the 'dimensionality curse' and the 'self-similarity blessing'. IEEE Trans. Knowl. Data Engin. 13, 1, 96--111. Google ScholarDigital Library
- Koudas, N., Ooi, B. C., Shen, H. T., and Tung, A. K. H. 2004. Ldc: Enabling search by partial distance in a hyper-dimensional space. In Proceedings of the International Conference on Data Engineering (ICDE'04). 6--17. Google ScholarDigital Library
- Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'04). 798--807. Google ScholarDigital Library
- Lenhof, H.-P. and Smid, M. 1992. Enumerating the k closest pairs optimally. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'92). 380--386. Google ScholarDigital Library
- Li, C., Chang, E. Y., Garcia-Molina, H., and Wiederhold, G. 2002. Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans. Knowl. Data Engin. 14, 4, 792--808. Google ScholarDigital Library
- Lin, K.-I., Jagadish, H. V., and Faloutsos, C. 1994. The tv-tree: An index structure for high-dimensional data. VLDB J. 3, 4, 517--542. Google ScholarDigital Library
- Lopez, M. A. and Liao, S. 2000. Finding k-closest-pairs efficiently for high dimensional data. In Proceedings of the Canadian Conference on Computational Geometry (CCCG'00). 197--204.Google Scholar
- Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. 2007. Multi-Probe lsh: Efficient indexing for high-dimensional similarity search. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 950--961. Google ScholarDigital Library
- Panigrahy, R. 2006. Entropy based nearest neighbor search in high dimensions. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). 1186--1195. Google ScholarDigital Library
- Roussopoulos, N., Kelley, S., and Vincent, F. 1995. Nearest neighbor queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 71--79. Google ScholarDigital Library
- Shamos, M. I. and Hoey, D. 1975. Closest-point problems. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS'75). 151--162. Google ScholarDigital Library
- Tao, Y., Yi, K., Sheng, C., and Kalnis, P. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 563--576. Google ScholarDigital Library
- Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Databases (VLDB'98). 194--205. Google ScholarDigital Library
- Wong, R. C.-W., Tao, Y., Fu, A. W.-C., and Xiao, X. 2007. On efficient spatial matching. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 579--590. Google ScholarDigital Library
Index Terms
- Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
Recommendations
Quality and efficiency in high dimensional nearest neighbor search
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataNearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset ...
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and ApplicationsAbstractLocality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space
A fast approximate nearest neighbor search algorithm for the (binary) Hamming space is proposed. The proposed Error Weighted Hashing (EWH) algorithm is up to 20 times faster than the popular locality sensitive hashing (LSH) algorithm and works well even ...
Comments