skip to main content
article

iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

Published:01 June 2005Publication History
Skip Abstract Section

Abstract

In this article, we present an efficient B+-tree based indexing method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional metric space. iDistance partitions the data based on a space- or data-partitioning strategy, and selects a reference point for each partition. The data points in each partition are transformed into a single dimensional value based on their similarity with respect to the reference point. This allows the points to be indexed using a B+-tree structure and KNN search to be performed using one-dimensional range search. The choice of partition and reference points adapts the index structure to the data distribution.We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness. We also present a cost model for iDistance KNN search, which can be exploited in query optimization.

References

  1. Aggarwal, C., Procopiuc, C., Wolf, J., Yu, P., and Park, J. 1999. Fast algorithm for projected clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arya, S., Mount, D., Netanyahu, N., Silverman, R., and Wu, A. 1994. An optimal algorithm for approximate nearest neighbor searching. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 573--582.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arya, S., Mount, D., Netanyahu, N., Silverman, R., and Wu, A. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6, 891--923.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Berchtold, S., Böhm, C., Jagadish, H., Kriegel, H., and Sander, J. 2000. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the International Conference on Data Engineering. 577--588.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Berchtold, S., Böhm, C., and Kriegel, H.-P. 1998a. The pyramid-technique: Towards breaking the curse of dimensionality. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 142--153.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Berchtold, S., Ertl, B., Keim, D., Kriegel, H.-P., and Seidl, T. 1998b. Fast nearest neighbor search in high-dimensional space. In Proceedings of the International Conference on Data Engineering. 209--218.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Berchtold, S., Keim, D., and Kriegel, H. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the International Conference on Very Large Data Bases. 28--37.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is nearest neighbors meaningful? In Proceedings of the International Conference on Database Theory.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Böhm, C., Berchtold, S., and Keim, D. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bozkaya, T. and Ozsoyoglu, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 357--368.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chakrabarti, K. and Mehrotra, S. 1999. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings of the International Conference on Data Engineering. 322--331.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chakrabarti, K. and Mehrotra, S. 2000. Local dimensionality reduction: a new approach to indexing high dimensional spaces. In Proceedings of the International Conference on Very Large Databases. 89--100.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ciaccia, P., Patella, M., and Zezula, P. 1997. M-trees: An efficient access method for similarity search in metric space. In Proceedings of the International Conference on Very Large Data Bases. 426--435.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cui, B., Ooi, B. C., Su, J. W., and Tan, K. L. 2003. Contorting high dimensional data for efficient main memory processing. In Proceedings of the ACM SIGMOD Conference. 479--490.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cui, B., Ooi, B. C., Su, J. W., and Tan, K. L. 2004. Indexing high-dimensional data for efficient in-memory similarity search. In IEEE Trans. Knowl. Data Eng. to appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Faloutsos, C. and Lin, K.-I. 1995. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 163--174.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Filho, R. F. S., Traina, A., and Faloutsos, C. 2001. Similarity search without tears: The omni family of all-purpose access methods. In Proceedings of the International Conference on Data Engineering. 623--630.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Goldstein, J. and Ramakrishnan, R. 2000. Contrast plots and p-sphere trees: space vs. time in nearest neighbor searches. In Proceedings of the International Conference on Very Large Databases. 429--440.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guha, S., Rastogi, R., and Shim, K. 1998. Cure: an efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jagadish, H., Ooi, B. C., Tan, K.-L., Yu, C., and Zhang, R. 2004. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. Tech. Rep. www.comp.nus.edu.sg/~ooibc, National University of Singapore.]]Google ScholarGoogle Scholar
  22. Jolliffe, I. T. 1986. Principle Component Analysis. Springer-Verlag.]]Google ScholarGoogle Scholar
  23. Katamaya, N. and Satoh, S. 1997. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Koudas, N., Ooi, B. C., Tan, K.-L., and Zhang, R. 2004. Approximate NN queries on streams with guaranteed error/performance bounds. In Proceedings of the International Conference on Very Large Data Bases. 804--815.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kruskal, J. B. 1956. On the shortest spanning subtree of a graph and the travelling salesman problem. In Proceedings of the American Mathematical Society 7, 48--50.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. Lin, K., Jagadish, H., and Faloutsos, C. 1995. The TV-tree: An index structure for high-dimensional data. VLDB Journal 3, 4, 517--542.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Fifth Berkeley Symposium on Mathematical statistics and probability. University of California Press, 281--297.]]Google ScholarGoogle Scholar
  28. Ooi, B. C., Tan, K. L., Yu, C., and Bressan, S. 2000. Indexing the edge: a simple and yet efficient approach to high-dimensional indexing. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 166--174.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Pagel, B.-U., Korn, F., and Faloutsos, C. 2000. Deflating the dimensionality curse using multiple fractal dimensions. In Proceedings of the International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ramakrishnan, R. and Gehrke, J. 2000. Database Management Systems. McGraw-Hill.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sakurai, Y., Yoshikawa, M., and Uemura, S. 2000. The a-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the International Conference on Very Large Data Bases. 516--526.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tao, Y., Faloutsos, C., and Papadias, D. 2003. The power-method: A comprehensive estimation technique for multi-dimensional queries. In Proceedings of the Conference on Information and Knowledge Management.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Traina, A., Seeger, B., and Faloutsos, C. 2000. Slim-trees: high performance metric trees minimizing overlap between nodes. In Advances in Database Technology---EDBT 2000, International Conference on Extending Database Technology, Konstanz, Germany, March 27--31, 2000, Proceedings. Lecture Notes in Computer Science, vol. 1777. Springer-Verlag, 51--65.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Weber, R., Schek, H., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data Bases. 194--205.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. White, D. and Jain, R. 1996. Similarity indexing with the SS-tree. In Proceedings of the International Conference on Data Engineering. 516--523.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yu, C., Ooi, B. C., Tan, K. L., and Jagadish, H. 2001. Indexing the distance: an efficient method to knn processing. In Proceedings of the International Conference on Very Large Data Bases. 421--430.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: an efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 30, Issue 2
      June 2005
      328 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1071610
      Issue’s Table of Contents

      Copyright © 2005 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2005
      Published in tods Volume 30, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader