skip to main content
article
Free Access

Indexing large metric spaces for similarity search queries

Published:01 September 1999Publication History
Skip Abstract Section

Abstract

One of the common queries in many database applications is finding approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance-based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high-dimensional data) and the distance function is metric. In this paper we consider using distance-based index structures for similarity queries on large metric spaces. We elaborate on the approach that uses reference points (vantage points) to partition the data space into spherical shell-like regions in a hierarchical manner. We introduce the multivantage point tree structure (mvp-tree) that uses more than one vantage point to partiton the space into spherical cuts at each level. In answering similarity-based queries, the mvp-tree also utilizes the precomputed (at construction time) distances between the data points and the vantage points.

We summarize the experiments comparing mvp-trees to vp-trees that have a similar partitioning strategy, but use only one vantage point at each level and do not make use of the precomputed distances. Empirical studies show that the mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. Next, we generalize the idea of using multiple vantage points and discuss the results of experiments we have made to see how varying the number of vantage points in a node affects affects performance and how much is gained in performance by making use of precomputed distances. The results show that, after all, it may be best to use a large number of vantage points in an internal node in order to end up with a single directory node and keep as many of the precomputed distances as possible to provide more efficient filtering during search operations. Finally, we provide some experimental results that compare mvp-trees with M-trees, which is a dynamic distance-based index structure for metric domains.

References

  1. AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the Conference on FODO Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. BURKHARD,W.A.AND KELLER, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4 (Apr.), 230-236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The R * -tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD '90, Atlantic City, NJ, May 23-25, 1990), H. Garcia-Molina, Ed. ACM Press, New York, NY, 322-331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BERCHTOLD, S., B~HM, C., KEIM,D.A.,AND KRIEGEL, H.-P. 1997. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS '97, Tucson, AZ, May 12-14, 1997), A. Mendelzon and Z. M. ~zsoyoglu, Eds. ACM Press, New York, NY, 78-86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BERCHTOLD, S., KEIM,D.A.,AND KRIEGEL, H.-P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Mumbai, India, Sept.) 28-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BOZKAYA,T.AND OZSOYOGLU, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '97, Tucson, AZ, May 13-15), J. M. Peckman, S. Ram, and M. Franklin, Eds. ACM Press, New York, NY, 357-368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. BRIN, S. 1995. Near neighbor search in large metric space. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.) 574-584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. CHIUEH, T. 1994. Content-based image indexing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94, Santiago, Chile, Sept.) VLDB Endowment, Berkeley, CA, 582-593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. CIACCIA,P.AND PATELLA, M. 1998a. Bulk loading the M-tree. In Proceedings of the Australasian Conference on Databases (ADC, Perth, Australia)Google ScholarGoogle Scholar
  10. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97, Athens, Greece, Aug.) 426-435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b. Processing complex similarity queries with distance-based access methods. In Proceedings of the 6th International Conference on EDBT (EDBT, Mar.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Princi-ples of Database Systems (PODS '98, Seattle, WA, June 1-3, 1998), A. Mendelson and J. Paredaens, Eds. ACM Press, New York, NY, 59-68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. FALOUTSOS, C., BARBER, R., FLICKNER, M., HAFNER, J., NIBLACK, W., PETKOVIC, D., AND EQUITZ, W. 1994a. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4 (July 1994), 231-262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS, Y. 1994b. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Confer-ence on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 419-429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference on Management of Data ACM Press, New York, NY, 47-57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. LIN, K-I., JAGADISH, H., AND FALOUTSOS, C. 1994. The TV-tree:An index structure for high dimensional data. VLDB J. 3 (Oct.), 517-542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. OTTERMAN, M. 1992. Approximate matching with high dimensionality R-trees. Department of Computer Science, University of Maryland, College Park, MD.Google ScholarGoogle Scholar
  18. ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95, San Jose, CA, May 23-25), M. Carey and D. Schneider, Eds. ACM Press, New York, NY, 71-79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. SAMET, H. 1990. The Design and Analysis of Spatial Data Structures. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R1-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (Brighton, UK, Sept.) 71-79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. SHASHA,D.AND WANG, T.-L. 1990. New techniques for best-match retrieval. ACM Trans. Inf. Syst. 8, 2 (Apr. 1990), 140-158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. UHLMANN, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4 (Nov.), 175-179.Google ScholarGoogle ScholarCross RefCross Ref
  23. BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the Fifth Symposium on Combinatorial Pattern Matching (LNCS 807, June) Springer-Verlag, New York, 198-212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. YIANNILOS, P. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms ACM Press, New York, NY, 311-321. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Indexing large metric spaces for similarity search queries

            Recommendations

            Reviews

            Yannis P. Manolopoulos

            A method generalizing the existing vp-trees for indexing in metric spaces is presented. The superiority of the new method over vp-trees, and over another dynamic indexing method—M-trees—is examined experimentally. The problems of the straightforward generalization of vp-trees, that is, multiway vp-trees, are recognized. The authors clearly present the impact of both of their main contributions: the use of multiple vantage points and the storage of precomputed distances of data points to vantage points in their path. The transition from two vantage points to multiple vantage points is given gracefully. All important parameters related to performance are described and tuned experimentally. The paper ends with an extensive experimental comparison of all the methods. The theoretical basis of the paper is not as significant. The paper does not contain analytical models, which could verify the impact of the distribution of the distances (such models have been developed for the M-tree). The authors do not adequately explain why the use of multiple vantage points, instead of two vantage points, does not always pay off. As described in the paper, the best mvp-tree index turns out to be a two-level directory. When I/O cost is examined, we would not expect this to be optimal. Similarly, a complete comparison with the M-tree should examine the I/O cost, since the M-tree was designed to minimize this cost (and the number of distance calculations). Finally, the problem of dynamic updating should be considered, since balanced trees, such as M-trees, can handle it adequately. However, several of the above issues are identified as future work. The mvp-tree is currently the best indexing method for data in metric spaces. The authors present this generalization of the vp-tree indexing method and verify its superiority with extensive experimental results.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Database Systems
              ACM Transactions on Database Systems  Volume 24, Issue 3
              Sept. 1999
              133 pages
              ISSN:0362-5915
              EISSN:1557-4644
              DOI:10.1145/328939
              Issue’s Table of Contents

              Copyright © 1999 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 September 1999
              Published in tods Volume 24, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader