article

Free Access

Indexing large metric spaces for similarity search queries

Authors:
Tolga Bozkaya

Oracle Corp., Redwood Shores, CA

Oracle Corp., Redwood Shores, CA
View Profile

,
Meral Ozsoyoglu

Case Western Reserve Univ., Cleveland, OH

Case Western Reserve Univ., Cleveland, OH
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 24 Issue 3pp 361–404https://doi.org/10.1145/328939.328959

Published:01 September 1999Publication History

ACM Transactions on Database Systems

Abstract

One of the common queries in many database applications is finding approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance-based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high-dimensional data) and the distance function is metric. In this paper we consider using distance-based index structures for similarity queries on large metric spaces. We elaborate on the approach that uses reference points (vantage points) to partition the data space into spherical shell-like regions in a hierarchical manner. We introduce the multivantage point tree structure (mvp-tree) that uses more than one vantage point to partiton the space into spherical cuts at each level. In answering similarity-based queries, the mvp-tree also utilizes the precomputed (at construction time) distances between the data points and the vantage points.

We summarize the experiments comparing mvp-trees to vp-trees that have a similar partitioning strategy, but use only one vantage point at each level and do not make use of the precomputed distances. Empirical studies show that the mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. Next, we generalize the idea of using multiple vantage points and discuss the results of experiments we have made to see how varying the number of vantage points in a node affects affects performance and how much is gained in performance by making use of precomputed distances. The results show that, after all, it may be best to use a large number of vantage points in an internal node in order to end up with a single directory node and keep as many of the precomputed distances as possible to provide more efficient filtering during search operations. Finally, we provide some experimental results that compare mvp-trees with M-trees, which is a dynamic distance-based index structure for metric domains.

References

AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the Conference on FODO Google ScholarDigital Library
BURKHARD,W.A.AND KELLER, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4 (Apr.), 230-236. Google ScholarDigital Library
BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The R * -tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD '90, Atlantic City, NJ, May 23-25, 1990), H. Garcia-Molina, Ed. ACM Press, New York, NY, 322-331. Google ScholarDigital Library
BERCHTOLD, S., B~HM, C., KEIM,D.A.,AND KRIEGEL, H.-P. 1997. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS '97, Tucson, AZ, May 12-14, 1997), A. Mendelzon and Z. M. ~zsoyoglu, Eds. ACM Press, New York, NY, 78-86. Google ScholarDigital Library
BERCHTOLD, S., KEIM,D.A.,AND KRIEGEL, H.-P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Mumbai, India, Sept.) 28-39. Google ScholarDigital Library
BOZKAYA,T.AND OZSOYOGLU, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '97, Tucson, AZ, May 13-15), J. M. Peckman, S. Ram, and M. Franklin, Eds. ACM Press, New York, NY, 357-368. Google ScholarDigital Library
BRIN, S. 1995. Near neighbor search in large metric space. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.) 574-584. Google ScholarDigital Library
CHIUEH, T. 1994. Content-based image indexing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94, Santiago, Chile, Sept.) VLDB Endowment, Berkeley, CA, 582-593. Google ScholarDigital Library
CIACCIA,P.AND PATELLA, M. 1998a. Bulk loading the M-tree. In Proceedings of the Australasian Conference on Databases (ADC, Perth, Australia)Google Scholar
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97, Athens, Greece, Aug.) 426-435. Google ScholarDigital Library
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b. Processing complex similarity queries with distance-based access methods. In Proceedings of the 6th International Conference on EDBT (EDBT, Mar.) Google ScholarDigital Library
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Princi-ples of Database Systems (PODS '98, Seattle, WA, June 1-3, 1998), A. Mendelson and J. Paredaens, Eds. ACM Press, New York, NY, 59-68. Google ScholarDigital Library
FALOUTSOS, C., BARBER, R., FLICKNER, M., HAFNER, J., NIBLACK, W., PETKOVIC, D., AND EQUITZ, W. 1994a. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4 (July 1994), 231-262. Google ScholarDigital Library
FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS, Y. 1994b. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Confer-ence on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 419-429. Google ScholarDigital Library
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference on Management of Data ACM Press, New York, NY, 47-57. Google ScholarDigital Library
LIN, K-I., JAGADISH, H., AND FALOUTSOS, C. 1994. The TV-tree:An index structure for high dimensional data. VLDB J. 3 (Oct.), 517-542. Google ScholarDigital Library
OTTERMAN, M. 1992. Approximate matching with high dimensionality R-trees. Department of Computer Science, University of Maryland, College Park, MD.Google Scholar
ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95, San Jose, CA, May 23-25), M. Carey and D. Schneider, Eds. ACM Press, New York, NY, 71-79. Google ScholarDigital Library
SAMET, H. 1990. The Design and Analysis of Spatial Data Structures. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA. Google ScholarDigital Library
SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R1-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (Brighton, UK, Sept.) 71-79. Google ScholarDigital Library
SHASHA,D.AND WANG, T.-L. 1990. New techniques for best-match retrieval. ACM Trans. Inf. Syst. 8, 2 (Apr. 1990), 140-158. Google ScholarDigital Library
UHLMANN, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4 (Nov.), 175-179.Google ScholarCross Ref
BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the Fifth Symposium on Combinatorial Pattern Matching (LNCS 807, June) Springer-Verlag, New York, 198-212. Google ScholarDigital Library
YIANNILOS, P. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms ACM Press, New York, NY, 311-321. Google ScholarDigital Library

Index Terms

Indexing large metric spaces for similarity search queries
1. Information systems
  1. Information retrieval
  2. Information storage systems
    1. Record storage systems
      1. Directory structures
        B-trees
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Trees

Recommendations

Distance-based indexing for high-dimensional metric spaces
SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data

In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query ...
Read More
On Index-Free Similarity Search in Metric Spaces
DEXA '09: Proceedings of the 20th International Conference on Database and Expert Systems Applications

Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at ...
Read More
Distance-based indexing for high-dimensional metric spaces

In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query ...
Read More

Reviews

Reviewer: Yannis P. Manolopoulos

A method generalizing the existing vp-trees for indexing in metric spaces is presented. The superiority of the new method over vp-trees, and over another dynamic indexing method—M-trees—is examined experimentally. The problems of the straightforward generalization of vp-trees, that is, multiway vp-trees, are recognized. The authors clearly present the impact of both of their main contributions: the use of multiple vantage points and the storage of precomputed distances of data points to vantage points in their path. The transition from two vantage points to multiple vantage points is given gracefully. All important parameters related to performance are described and tuned experimentally. The paper ends with an extensive experimental comparison of all the methods. The theoretical basis of the paper is not as significant. The paper does not contain analytical models, which could verify the impact of the distribution of the distances (such models have been developed for the M-tree). The authors do not adequately explain why the use of multiple vantage points, instead of two vantage points, does not always pay off. As described in the paper, the best mvp-tree index turns out to be a two-level directory. When I/O cost is examined, we would not expect this to be optimal. Similarly, a complete comparison with the M-tree should examine the I/O cost, since the M-tree was designed to minimize this cost (and the number of distance calculations). Finally, the problem of dynamic updating should be considered, since balanced trees, such as M-trees, can handle it adequately. However, several of the above issues are identified as future work. The mvp-tree is currently the best indexing method for data in metric spaces. The authors present this generalization of the vp-tree indexing method and verify its superiority with extensive experimental results.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 24, Issue 3
Sept. 1999
133 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/328939
Issue’s Table of Contents

Copyright © 1999 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 1999
Published in tods Volume 24, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 187
  Total Citations
  View Citations
- 1,962
  Total Downloads
- Downloads (Last 12 months)107
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Distance-based indexing for high-dimensional metric spaces

On Index-Free Similarity Search in Metric Spaces

Distance-based indexing for high-dimensional metric spaces

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Distance-based indexing for high-dimensional metric spaces

On Index-Free Similarity Search in Metric Spaces

Distance-based indexing for high-dimensional metric spaces

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media