Abstract
The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality." We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.
- APERS, P., BLANKEN, H., AND HOUTSMA, M. 1997. Multimedia Databases in Perspective. Springer- Verlag, New York.]] Google ScholarDigital Library
- ARYA, S., MOUNT, D., NETANYAHU, N., SILVERMAN, R., AND WU, A. 1994. An optimal algorithm for approximate nearest neighbor searching in fixed dimension. In Proceedings of the Fifth ACM-SIAM Symposium on Discrete Algorithms (SODA'94), 573-583.]] Google ScholarDigital Library
- AURENHAMMER, F. 1991. Voronoi diagrams-A survey of a fundamental geometric data structure. ACM Comput. Surv. 23,3.]] Google ScholarDigital Library
- BAEZA-YATES, R. 1997. Searching: An algorithmic tour. In Encyclopedia of Computer Science and Technology, vol. 37, A. Kent and J. Williams, Eds., Marcel Dekker, New York, 331-359.]]Google Scholar
- BAEZA-YATES,R.AND NAVARRO, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the Fifth South American Symposium on String Processing and Information Retrieval (SPIRE'98), IEEE Computer Science Press, Los Alamitos, Calif., 14-22.]]Google ScholarCross Ref
- BAEZA-YATES,R.AND RIBEIRO-NETO, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, Mass.]] Google ScholarDigital Library
- BAEZA-YATES, R., CUNTO, W., MANBER,U.,AND WU,S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the Fifth Combinatorial Pattern Matching (CPM'94), Lecture Notes in Computer Science, vol. 807, 198-212.]] Google ScholarDigital Library
- BENTLEY, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509-517.]] Google ScholarDigital Library
- BENTLEY, J. 1979. Multidimensional binary search trees in database applications. IEEE Trans. Softw. Eng. 5, 4, 333-340.]]Google ScholarDigital Library
- BENTLEY, J., WEIDE,B.,AND YAO, A. 1980. Optimal expected-time algorithms for closest point problems. ACM Trans. Math. Softw. 6, 4, 563-580.]] Google ScholarDigital Library
- BERCHTOLD, S., KEIM,D.,AND KRIEGEL, H. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd Conference on Very Large Databases (VLDB'96), 28-39.]] Google ScholarDigital Library
- BHANU, B., PENG,J.,AND QING, S. 1998. Learning feature relevance and similarity metrics in image databases. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE Computer Society, Los Alamitos, Calif., 14-18.]] Google ScholarDigital Library
- BIMBO,A.D.AND VICARIO, E. 1998. Using weighted spatial relationships in retrieval by visual contents. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE Computer Society, Los Alamitos, Calif., 35-39.]] Google ScholarDigital Library
- B~HM, C., BERCHTOLD,S.,AND KEIM, A. 2002. Searching in high-dimensional spaces-index structures for improving the performance of multimedia databases. ACM Comput. Surv. To appear.]] Google ScholarDigital Library
- BOZKAYA,T.AND OZSOYOGLU, M. 1997. Distancebased indexing for high-dimensional metric spaces. In Proceedings of ACM SIGMOD International Conference on Management of Data, SIGMOD Rec. 26, 2, 357-368.]] Google ScholarDigital Library
- BRIN, S. 1995. Near neighbor search in large metric spaces. In Proceedings of the 21st Conference on Very Large Databases (VLDB'95), 574-584.]] Google ScholarDigital Library
- BRITO, M., CH~VEZ, E., QUIROZ, A., AND YUKICH,J. 1996. Connectivity of the mutual k-nearest neighbor graph in clustering and outlier detection. Stat. Probab. Lett. 35, 33-42.]]Google ScholarCross Ref
- BUGNION, E., FHEI, S., ROOS, T., WIDMAYER,P., AND WIDMER, F. 1993. A spatial index for approximate multiple string matching. In Proceedings of the First South American Workshop on String Processing (WSP'93), R. Baeza-Yates and N. Ziviani, Eds., 43-53.]]Google Scholar
- BURKHARD,W.AND KELLER, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4, 230-236.]] Google ScholarDigital Library
- CASCIA, M. L., SETHI,S.,AND SCLAROFF, S. 1998. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE Computer Society, Los Alamitos, Calif., 24-28.]] Google ScholarDigital Library
- CH~VEZ, E. 1999. Optimal discretization for pivot based algorithms. Manuscript. ftp:// garota.fismat.umich.mx/pub/users/elchavez/ minimax.ps.gz.]]Google Scholar
- CH~VEZ,E.AND MARROQUYN, J. 1997. Proximity queries in metric spaces. In Proceedings of the Fourth South American Workshop on String Processing ( WSP'97), R. Baeza-Yates, Ed. Carleton University Press, Ottawa, 21-36.]]Google Scholar
- CH~VEZ,E.AND NAVARRO, G. 2000. An effective clustering algorithm to index high dimensional metric spaces. In Proceedings of the Seventh String Processing and Informational Retrieval (SPIRE'00), IEEE CS Press, 75-86.]] Google ScholarDigital Library
- CH~VEZ,E.AND NAVARRO, G. 2001a. Towards measuring the searching complexity of metric spaces. In Proceedings of the Mexican Computing Meeting, vol. II, 969-972.]]Google Scholar
- CH~VEZ,E.AND NAVARRO, G. 2001b. A probabilistic spell for the curse of dimensionality. In Proceedings of the Third Workshop on Algorithm Engineering and Experimentation (ALENEX'01), 147-160. Lecture Notes in Computer Science v. 2153.]] Google ScholarDigital Library
- CH~VEZ,E.AND NAVARRO, G. 2002. A metric index for approximate string matching. In Proceedings of the Fifth Latin American Symposium on Theoretical Informatics (LATIN'02), To appear, Lecture Notes in Computer Science.]] Google ScholarDigital Library
- CH~VEZ, E., MARROQU~N,J.,AND BAEZA-YATES,R. 1999. Spaghettis: An array based algorithm for similarity queries in metric spaces. In Proceedings of String Processing and Information Retrieval (SPIRE'99), IEEE Computer Science Press, Los Alamitos, Calif., 38-46.]] Google ScholarDigital Library
- CH~VEZ, E., MARROQUIN,J.L.,AND NAVARRO, G. 2001. Fixed queries array: a fast and economical data structure for proximity searching. Multimedia Tools and Applications 14 (2), 113-135. Kluwer.]] Google ScholarDigital Library
- CHAZELLE, B. 1994. Computational geometry: A retrospective. In Proceedings of the 26th ACM Symposium on the Theory of Computing (STOC'94), 75-94.]] Google ScholarDigital Library
- CHIUEH, T. 1994. Content-based image indexing. In Proceedings of the Twentieth Conference on Very Large Databases ( VLDB'94), 582-593.]] Google ScholarDigital Library
- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M- tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd Conference on Very Large Databases ( VLDB'97), 426-435.]] Google ScholarDigital Library
- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998a. A cost model for similarity queries in metric spaces. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98).]] Google ScholarDigital Library
- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b. Processing complex similarity queries with distance-based access methods. In Proceedings of the Sixth International Conference on Extending Database Technology (EDBT'98).]] Google ScholarDigital Library
- CLARKSON, K. 1999. Nearest neighbor queries in metric spaces. Discrete Comput. Geom. 22,1, 63-93.]]Google ScholarCross Ref
- COX,T.AND COX, M. 1994. Multidimensional Scaling. Chapman and Hall, London.]]Google Scholar
- DEHNE,F.AND NOLTEMEIER, H. 1987. Voronoi trees and clustering problems. Inf. Syst. 12,2, 171-175.]] Google ScholarDigital Library
- DEVROYE, L. 1987. A Course in Density Estimation. Birkhauser, Boston.]] Google ScholarDigital Library
- DUDA,R.AND HART, P. 1973. Pattern Classification and Scene Analysis. Wiley, New York.]]Google Scholar
- FALOUTSOS,C.AND KAMEL, I. 1994. Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension. In Proceedings of the Thirteenth ACM Symposium on Principles of Database Principles (PODS'94), 4-13.]] Google ScholarDigital Library
- FALOUTSOS,C.AND LIN, K. 1995. Fastmap: A fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. ACM SIGMOD Rec. 24, 2, 163-174.]] Google ScholarDigital Library
- FALOUTSOS, C., EQUITZ, W., FLICKNER, M., NIBLACK,W., PETKOVIC,D.,AND BARBER, R. 1994. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4, 231-262.]] Google ScholarDigital Library
- FARAG~, A., LINDER,T.,AND LUGOSI, G. 1993. Fast nearest-neighbor search in dissimilarity spaces. IEEE Trans. Patt. Anal. Mach. Intell. 15,9, 957-962.]] Google ScholarDigital Library
- FRAKES,W.AND BAEZA-YATES, R., EDS. 1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, N.J.]] Google ScholarDigital Library
- GAEDE,V.AND GUNTHER, O. 1998. Multidimensional access methods. ACMComput. Surv. 30,2, 170-231.]] Google ScholarDigital Library
- GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 47-57.]] Google ScholarDigital Library
- HAIR, J., ANDERSON, R., TATHAM, R., AND BLACK,W. 1995. Multivariate Data Analysis with Readings, 4th ed. Prentice-Hall, Englewood Cliffs, N.J.]] Google ScholarDigital Library
- HJALTASON,G.AND SAMET, H. 1995. Ranking in spatial databases. In Proceedings of Fourth International Symposium on Large Spatial Databases, 83-95.]] Google ScholarDigital Library
- JAIN,A.AND DUBES, R. 1988. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, N.J.]] Google ScholarDigital Library
- KALANTARI,I.AND MCDONALD, G. 1983. A data structure and an algorithm for the nearest point problem. IEEE Trans. Softw. Eng. 9,5.]]Google Scholar
- MEHLHORN, K. 1984. Data Structures and Algorithms, Volume III "Multidimensional Searching and Computational Geometry. Springer-Verlag, New York.]] Google ScholarDigital Library
- MICO, L., ONCINA,J.,AND CARRASCO, R. 1996. A fast branch and bound nearest neighbour classifier in metric spaces. Patt. Recogn. Lett. 17, 731-739.]] Google ScholarDigital Library
- MICO, L., ONCINA,J.,AND VIDAL, E. 1994. A new version of the nearest-neighbor approximating and eliminating search (AESA) with linear preprocessing-time and memory requirements. Patt. Recog. Lett. 15, 9-17.]] Google ScholarDigital Library
- NAVARRO, G. 1999. Searching in metric spaces by spatial approximation. In Proceedings of String Processing and Information Retrieval (SPIRE'99), IEEE Computer Science Press, Los Alamitos, Calif., 141-148.]] Google ScholarDigital Library
- NAVARRO,G.AND REYES, N. 2001. Dynamic spatial approximation trees. In Proceedings of the Eleventh Conference of the Chilean Computer Science Society (SCCC'01), IEEE CS Press, 213-222.]]Google ScholarCross Ref
- NENE,S.AND NAYAR, S. 1997. A simple algorithm for nearest neighbor search in high dimensions. IEEE Trans. Patt. Anal. Mach. Intell. 19,9, 989-1003.]] Google ScholarDigital Library
- NIEVERGELT,J.AND HINTERBERGER, H. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9,1, 38-71.]] Google ScholarDigital Library
- NOLTEMEIER, H. 1989. Voronoi trees and applications. In Proceedings of the International Workshop on Discrete Algorithms and Complex-ity (Fukuoka, Japan), 69-74.]]Google Scholar
- NOLTEMEIER, H., VERBARG, K., AND ZIRKELBACH,C. 1992. Monotonous bisector" treesA tool for efficient partitioning of complex schenes of geometric objects. In Data Structures and Efficient Algorithms, Lecture Notes in Computer Science, vol. 594, Springer-Verlag, New York, 186-203.]] Google ScholarDigital Library
- PRABHAKAR, S., AGRAWAL,D.,AND ABBADI, A. E. 1998. Efficient disk allocation for fast similarity searching. In Proceedings of ACM SPAA'98 (Puerto Vallarta, Mexico).]] Google ScholarDigital Library
- ROUSSOPOULOS, N., KELLEY,S.,AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the ACM SIGMOD'95, 71-79.]] Google ScholarDigital Library
- SALTON,G.AND MCGILL, M. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.]] Google ScholarDigital Library
- SAMET, H. 1984. The quadtree and related hierarchical data structures. ACM Comput. Surv. 16, 2, 187-260.]] Google ScholarDigital Library
- SANKOFF,D.AND KRUSKAL, J., EDS. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Mass.]]Google Scholar
- SASHA,D.AND WANG, T. 1990. New techniques for best-match retrieval. ACM Trans. Inf. Syst. 8,2, 140-158.]] Google ScholarDigital Library
- SHAPIRO, M. 1977. The choice of reference points in best-match file searching. Commun. ACM 20, 5, 339-343.]] Google ScholarDigital Library
- SUTTON,R.AND BARTO, A. 1998. Reinforcement Learning : An Introduction. MIT Press, Cambridge, Mass.]] Google ScholarDigital Library
- UHLMANN, J. 1991a. Implementing metric trees to satisfy general proximity/similarity queries. Manuscript.]]Google Scholar
- UHLMANN, J. 1991b. Satisfying general proximity/ similarity queries with metric trees. Inf. Proc. Lett. 40, 175-179.]]Google ScholarCross Ref
- VERBARG, K. 1995. The C-Ttree: A dynamically balanced spatial index. Comput. Suppl. 10, 323-340.]]Google ScholarCross Ref
- VIDAL, E. 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Patt. Recogn. Lett. 4, 145-157.]] Google ScholarDigital Library
- WATERMAN, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.]]Google Scholar
- WEBER,R.SCHEK, H.-J., AND BLOTT, S. 1998. A quantitative analysis and performance study for similarity-search methods in highdimensional spaces. In Proceedings of the International Conference on Very Large Databases (VLDB'98).]] Google ScholarDigital Library
- WHITE,D.AND JAIN, R. 1996. Algorithms and strategies for similarity retrieval. Tech. Rep. VCL-96-101 (July), Visual Computing Laboratory, University of California, La Jolla.]]Google Scholar
- YAO, A. 1990. Computational Geometry,J.Van Leeuwen, Ed. Elsevier Science, New York, 345-380.]]Google Scholar
- YIANILOS, P. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth ACM-SIAM Symposium on Discrete Algorithms (SODA'93), 311-321.]] Google ScholarDigital Library
- YIANILOS, P. 1999. Excluded middle vantage point forests for nearest neighbor search. In DIMACS Implementation Challenge, ALENEX'99 (Baltimore, Md).]]Google Scholar
- YIANILOS, P. 2000. Locally lifting the curse of dimensionality for nearest neighbor search. In Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA'00), 361-370.]] Google ScholarDigital Library
- YOSHITAKA,A.AND ICHIKAWA, T. 1999. A survey on content-based retrieval for multimedia databases. IEEE Trans. Knowl. Data Eng. 11,1, 81-93.]] Google ScholarDigital Library
Index Terms
- Searching in metric spaces
Recommendations
Index-driven similarity search in metric spaces (Survey Article)
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this ...
A comprehensive empirical comparison of hubness reduction in high-dimensional spaces
Hubness is an aspect of the curse of dimensionality related to the distance concentration effect. Hubs occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data ...
How does high dimensionality affect collaborative filtering?
RecSys '09: Proceedings of the third ACM conference on Recommender systemsA crucial operation in memory-based collaborative filtering (CF) is determining nearest neighbors (NNs) of users/items. This paper addresses two phenomena that emerge when CF algorithms perform NN search in high-dimensional spaces that are typical in CF ...
Comments