Abstract
One of the common queries in many database applications is finding approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance-based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high-dimensional data) and the distance function is metric. In this paper we consider using distance-based index structures for similarity queries on large metric spaces. We elaborate on the approach that uses reference points (vantage points) to partition the data space into spherical shell-like regions in a hierarchical manner. We introduce the multivantage point tree structure (mvp-tree) that uses more than one vantage point to partiton the space into spherical cuts at each level. In answering similarity-based queries, the mvp-tree also utilizes the precomputed (at construction time) distances between the data points and the vantage points.
We summarize the experiments comparing mvp-trees to vp-trees that have a similar partitioning strategy, but use only one vantage point at each level and do not make use of the precomputed distances. Empirical studies show that the mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. Next, we generalize the idea of using multiple vantage points and discuss the results of experiments we have made to see how varying the number of vantage points in a node affects affects performance and how much is gained in performance by making use of precomputed distances. The results show that, after all, it may be best to use a large number of vantage points in an internal node in order to end up with a single directory node and keep as many of the precomputed distances as possible to provide more efficient filtering during search operations. Finally, we provide some experimental results that compare mvp-trees with M-trees, which is a dynamic distance-based index structure for metric domains.
- AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the Conference on FODO Google ScholarDigital Library
- BURKHARD,W.A.AND KELLER, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4 (Apr.), 230-236. Google ScholarDigital Library
- BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The R * -tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD '90, Atlantic City, NJ, May 23-25, 1990), H. Garcia-Molina, Ed. ACM Press, New York, NY, 322-331. Google ScholarDigital Library
- BERCHTOLD, S., B~HM, C., KEIM,D.A.,AND KRIEGEL, H.-P. 1997. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS '97, Tucson, AZ, May 12-14, 1997), A. Mendelzon and Z. M. ~zsoyoglu, Eds. ACM Press, New York, NY, 78-86. Google ScholarDigital Library
- BERCHTOLD, S., KEIM,D.A.,AND KRIEGEL, H.-P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Mumbai, India, Sept.) 28-39. Google ScholarDigital Library
- BOZKAYA,T.AND OZSOYOGLU, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '97, Tucson, AZ, May 13-15), J. M. Peckman, S. Ram, and M. Franklin, Eds. ACM Press, New York, NY, 357-368. Google ScholarDigital Library
- BRIN, S. 1995. Near neighbor search in large metric space. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.) 574-584. Google ScholarDigital Library
- CHIUEH, T. 1994. Content-based image indexing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94, Santiago, Chile, Sept.) VLDB Endowment, Berkeley, CA, 582-593. Google ScholarDigital Library
- CIACCIA,P.AND PATELLA, M. 1998a. Bulk loading the M-tree. In Proceedings of the Australasian Conference on Databases (ADC, Perth, Australia)Google Scholar
- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97, Athens, Greece, Aug.) 426-435. Google ScholarDigital Library
- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b. Processing complex similarity queries with distance-based access methods. In Proceedings of the 6th International Conference on EDBT (EDBT, Mar.) Google ScholarDigital Library
- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Princi-ples of Database Systems (PODS '98, Seattle, WA, June 1-3, 1998), A. Mendelson and J. Paredaens, Eds. ACM Press, New York, NY, 59-68. Google ScholarDigital Library
- FALOUTSOS, C., BARBER, R., FLICKNER, M., HAFNER, J., NIBLACK, W., PETKOVIC, D., AND EQUITZ, W. 1994a. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4 (July 1994), 231-262. Google ScholarDigital Library
- FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS, Y. 1994b. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Confer-ence on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 419-429. Google ScholarDigital Library
- GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference on Management of Data ACM Press, New York, NY, 47-57. Google ScholarDigital Library
- LIN, K-I., JAGADISH, H., AND FALOUTSOS, C. 1994. The TV-tree:An index structure for high dimensional data. VLDB J. 3 (Oct.), 517-542. Google ScholarDigital Library
- OTTERMAN, M. 1992. Approximate matching with high dimensionality R-trees. Department of Computer Science, University of Maryland, College Park, MD.Google Scholar
- ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95, San Jose, CA, May 23-25), M. Carey and D. Schneider, Eds. ACM Press, New York, NY, 71-79. Google ScholarDigital Library
- SAMET, H. 1990. The Design and Analysis of Spatial Data Structures. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA. Google ScholarDigital Library
- SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R1-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (Brighton, UK, Sept.) 71-79. Google ScholarDigital Library
- SHASHA,D.AND WANG, T.-L. 1990. New techniques for best-match retrieval. ACM Trans. Inf. Syst. 8, 2 (Apr. 1990), 140-158. Google ScholarDigital Library
- UHLMANN, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4 (Nov.), 175-179.Google ScholarCross Ref
- BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the Fifth Symposium on Combinatorial Pattern Matching (LNCS 807, June) Springer-Verlag, New York, 198-212. Google ScholarDigital Library
- YIANNILOS, P. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms ACM Press, New York, NY, 311-321. Google ScholarDigital Library
Index Terms
- Indexing large metric spaces for similarity search queries
Recommendations
Distance-based indexing for high-dimensional metric spaces
SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of dataIn many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query ...
On Index-Free Similarity Search in Metric Spaces
DEXA '09: Proceedings of the 20th International Conference on Database and Expert Systems ApplicationsMetric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at ...
Distance-based indexing for high-dimensional metric spaces
In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query ...
Comments