Abstract
In this article, we present an efficient B+-tree based indexing method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional metric space. iDistance partitions the data based on a space- or data-partitioning strategy, and selects a reference point for each partition. The data points in each partition are transformed into a single dimensional value based on their similarity with respect to the reference point. This allows the points to be indexed using a B+-tree structure and KNN search to be performed using one-dimensional range search. The choice of partition and reference points adapts the index structure to the data distribution.We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness. We also present a cost model for iDistance KNN search, which can be exploited in query optimization.
- Aggarwal, C., Procopiuc, C., Wolf, J., Yu, P., and Park, J. 1999. Fast algorithm for projected clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
- Arya, S., Mount, D., Netanyahu, N., Silverman, R., and Wu, A. 1994. An optimal algorithm for approximate nearest neighbor searching. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 573--582.]] Google ScholarDigital Library
- Arya, S., Mount, D., Netanyahu, N., Silverman, R., and Wu, A. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6, 891--923.]] Google ScholarDigital Library
- Berchtold, S., Böhm, C., Jagadish, H., Kriegel, H., and Sander, J. 2000. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the International Conference on Data Engineering. 577--588.]] Google ScholarDigital Library
- Berchtold, S., Böhm, C., and Kriegel, H.-P. 1998a. The pyramid-technique: Towards breaking the curse of dimensionality. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 142--153.]] Google ScholarDigital Library
- Berchtold, S., Ertl, B., Keim, D., Kriegel, H.-P., and Seidl, T. 1998b. Fast nearest neighbor search in high-dimensional space. In Proceedings of the International Conference on Data Engineering. 209--218.]] Google ScholarDigital Library
- Berchtold, S., Keim, D., and Kriegel, H. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the International Conference on Very Large Data Bases. 28--37.]] Google ScholarDigital Library
- Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is nearest neighbors meaningful? In Proceedings of the International Conference on Database Theory.]] Google ScholarDigital Library
- Böhm, C., Berchtold, S., and Keim, D. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.]] Google ScholarDigital Library
- Bozkaya, T. and Ozsoyoglu, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 357--368.]] Google ScholarDigital Library
- Chakrabarti, K. and Mehrotra, S. 1999. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings of the International Conference on Data Engineering. 322--331.]] Google ScholarDigital Library
- Chakrabarti, K. and Mehrotra, S. 2000. Local dimensionality reduction: a new approach to indexing high dimensional spaces. In Proceedings of the International Conference on Very Large Databases. 89--100.]] Google ScholarDigital Library
- Ciaccia, P., Patella, M., and Zezula, P. 1997. M-trees: An efficient access method for similarity search in metric space. In Proceedings of the International Conference on Very Large Data Bases. 426--435.]] Google ScholarDigital Library
- Cui, B., Ooi, B. C., Su, J. W., and Tan, K. L. 2003. Contorting high dimensional data for efficient main memory processing. In Proceedings of the ACM SIGMOD Conference. 479--490.]] Google ScholarDigital Library
- Cui, B., Ooi, B. C., Su, J. W., and Tan, K. L. 2004. Indexing high-dimensional data for efficient in-memory similarity search. In IEEE Trans. Knowl. Data Eng. to appear.]] Google ScholarDigital Library
- Faloutsos, C. and Lin, K.-I. 1995. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 163--174.]] Google ScholarDigital Library
- Filho, R. F. S., Traina, A., and Faloutsos, C. 2001. Similarity search without tears: The omni family of all-purpose access methods. In Proceedings of the International Conference on Data Engineering. 623--630.]] Google ScholarDigital Library
- Goldstein, J. and Ramakrishnan, R. 2000. Contrast plots and p-sphere trees: space vs. time in nearest neighbor searches. In Proceedings of the International Conference on Very Large Databases. 429--440.]] Google ScholarDigital Library
- Guha, S., Rastogi, R., and Shim, K. 1998. Cure: an efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
- Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57.]] Google ScholarDigital Library
- Jagadish, H., Ooi, B. C., Tan, K.-L., Yu, C., and Zhang, R. 2004. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. Tech. Rep. www.comp.nus.edu.sg/~ooibc, National University of Singapore.]]Google Scholar
- Jolliffe, I. T. 1986. Principle Component Analysis. Springer-Verlag.]]Google Scholar
- Katamaya, N. and Satoh, S. 1997. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
- Koudas, N., Ooi, B. C., Tan, K.-L., and Zhang, R. 2004. Approximate NN queries on streams with guaranteed error/performance bounds. In Proceedings of the International Conference on Very Large Data Bases. 804--815.]] Google ScholarDigital Library
- Kruskal, J. B. 1956. On the shortest spanning subtree of a graph and the travelling salesman problem. In Proceedings of the American Mathematical Society 7, 48--50.]]Google ScholarCross Ref
- Lin, K., Jagadish, H., and Faloutsos, C. 1995. The TV-tree: An index structure for high-dimensional data. VLDB Journal 3, 4, 517--542.]] Google ScholarDigital Library
- MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Fifth Berkeley Symposium on Mathematical statistics and probability. University of California Press, 281--297.]]Google Scholar
- Ooi, B. C., Tan, K. L., Yu, C., and Bressan, S. 2000. Indexing the edge: a simple and yet efficient approach to high-dimensional indexing. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 166--174.]] Google ScholarDigital Library
- Pagel, B.-U., Korn, F., and Faloutsos, C. 2000. Deflating the dimensionality curse using multiple fractal dimensions. In Proceedings of the International Conference on Data Engineering.]] Google ScholarDigital Library
- Ramakrishnan, R. and Gehrke, J. 2000. Database Management Systems. McGraw-Hill.]] Google ScholarDigital Library
- Sakurai, Y., Yoshikawa, M., and Uemura, S. 2000. The a-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the International Conference on Very Large Data Bases. 516--526.]] Google ScholarDigital Library
- Tao, Y., Faloutsos, C., and Papadias, D. 2003. The power-method: A comprehensive estimation technique for multi-dimensional queries. In Proceedings of the Conference on Information and Knowledge Management.]] Google ScholarDigital Library
- Traina, A., Seeger, B., and Faloutsos, C. 2000. Slim-trees: high performance metric trees minimizing overlap between nodes. In Advances in Database Technology---EDBT 2000, International Conference on Extending Database Technology, Konstanz, Germany, March 27--31, 2000, Proceedings. Lecture Notes in Computer Science, vol. 1777. Springer-Verlag, 51--65.]] Google ScholarDigital Library
- Weber, R., Schek, H., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data Bases. 194--205.]] Google ScholarDigital Library
- White, D. and Jain, R. 1996. Similarity indexing with the SS-tree. In Proceedings of the International Conference on Data Engineering. 516--523.]] Google ScholarDigital Library
- Yu, C., Ooi, B. C., Tan, K. L., and Jagadish, H. 2001. Indexing the distance: an efficient method to knn processing. In Proceedings of the International Conference on Very Large Data Bases. 421--430.]] Google ScholarDigital Library
- Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: an efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library
Index Terms
- iDistance: An adaptive B+-tree based indexing method for nearest neighbor search
Recommendations
A comprehensive study of idistance partitioning strategies for kNN queries and high-dimensional data indexing
BNCOD'13: Proceedings of the 29th British National conference on Big DataEfficient database indexing and information retrieval tasks such as k-nearest neighbor (kNN) search still remain difficult challenges in large-scale and high-dimensional data. In this work, we perform the first comprehensive analysis of different ...
Improving the Performance of High-Dimensional kNN Retrieval through Localized Dataspace Segmentation and Hybrid Indexing
ADBIS 2013: Proceedings of the 17th East European Conference on Advances in Databases and Information Systems - Volume 8133Efficient data indexing and nearest neighbor retrieval are challenging tasks in high-dimensional spaces. This work builds upon our previous analyses of iDistance partitioning strategies to develop the backbone of a new indexing method using a heuristic-...
Approximate Direct and Reverse Nearest Neighbor Queries, and the k-nearest Neighbor Graph
SISAP '09: Proceedings of the 2009 Second International Workshop on Similarity Search and ApplicationsRetrieving the \emph{k-nearest neighbors} of a query object is a basic primitive in similarity searching. A related, far less explored primitive is to obtain the dataset elements which would have the query object within their own \emph{k}-nearest ...
Comments