Skip to main content
Top
Published in: Datenbank-Spektrum 2/2017

20-06-2017 | Fachbeitrag

Reducing the Distance Calculations when Searching an M‑Tree

Authors: Steffen Guhlemann, Uwe Petersohn, Klaus Meyer-Wegener

Published in: Datenbank-Spektrum | Issue 2/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recent years have brought rising interest in efficiently searching for similar entities in a broad range of domains. Such search can be used to facilitate working with unstructured data such as genome sequences, text corpora, complex production information, or multimedia content, where queries always contain an amount of noise. In such domains the only common structure is a distance function obeying the axioms of a metric. As mostly no other structure information is available, a lot of distances have to be computed during the course of a search. Contrary to classical database indexes, where the optimization focus is on reducing the number of disk accesses (or in case of in-memory databases the number of tree traversal operations), a major cost driver in such multimedia domains is this number of distance calculations which can be very computation intense.
There exists a range of index structures for supporting similarity search in metric spaces. A very promising one is the M‑Tree, along with a number of compatible extensions (e. g. Slim-Tree, Bulk Loaded M‑Tree, multi way insertion M‑Tree, \(M^{2}\)-Tree, etc.). The M‑Tree family uses common algorithms for the \(k\)-nearest-neighbor and range search. These algorithms leave room for optimization in terms of necessary distance calculations. In this paper we present new algorithms for these tasks to considerably improve retrieval performance of all M‑Tree-compatible data structures.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Show more products
Footnotes
1
For some domains like string similarity, special indices using the domain structure exist (e. g. [18] or [30]), but they are not generally applicable.
 
2
A project seminar [23] compared different index structures in different domains. The M‑Tree had by far the best query performance in terms of necessary distance calculations.
 
3
Note, however, that there is a broad range of structures that also pretend to be an extension to the M‑Tree, like the Pivoting M‑Tree [33], the (B)\(M^{+}\)-Tree [42] or the CM-Tree [1], but are not compatible to the M‑Tree and also do not have the advantages of the M‑Tree. For example the (B)\(M^{+}\)-Tree can only handle Euclidean vector spaces for which better approaches like the kd-Tree exist.
 
4
As a simple example consider the Levenshtein edit distance of long texts. A basic distance calculation consumes time and space of \(O(N^{2})\) where \(N\) is the length of the texts. Reading the text from disk will require some time of \(O(\mathtt{Seek}+\mathtt{Read}\cdot N)\) where \(\mathtt{Seek}\) and \(\mathtt{Read}\)are constants. So, if the text is long enough, computing the distance will easily exceed the time to read the text from disk. The problem is worse for most multimedia domains, where one similarity calculation can be a complex optimization on its own.
 
5
Note that in case of a k‑nearest-neighbor query the query radius \(r_{q}\) will only be determined during the search. However, there will always be a known upper bound on this radius.
 
6
To decide on the parent node to insert a child and to adjust the parent-node radius, this distance must be computed in any case during insert.
 
7
Each node stores only one distance. Thus, the storage complexity is \(O(1)\). Even in absolute terms this single number is usually negligible as typical metric data (e. g. texts, images, genome sequences) are big. Of course in edge cases (2D-data) this can mean an increase of 33%.
 
8
AESA [38] is a similarity index structure that precalculates all bilateral distances. During search these distances can be used to prune objects based on few computed distances and the triangle inequality.
 
9
For the search algorithm it is usually assumed that all tree traversal and heuristics operations are essentially for free compared to a single distance computation. So they are used in abundance. This of course depends on the relative cost of a distance calculation. So it might be valid for typical metric domains (texts, multimedia) but is invalid when for example a low to medium dimensional euclidean distance is used.
 
10
[43] propose a cost-benefit ratio to choose between heuristics in the context of multi-feature similarity search.
 
11
The classical M‑Tree claims to be balanced and to have a minimal node capacity. However, this property depends on the choice of a balanced node split strategy which is the worst of all split options. The choice of a split strategy optimizing the tree structure for search leads to an unbalanced tree. (See for example [19].)
 
12
The effort of a single operation on such a queue is \(O(\log N)\) – so all operations will be a bit more expensive.
 
13
At each point the wafer has a deviation in x‑ and y‑direction from its nominal position.
 
Literature
1.
go back to reference Aronovich L, Spiegler I (2007) CM-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946CrossRef Aronovich L, Spiegler I (2007) CM-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946CrossRef
2.
go back to reference Baeza-Yates R, Cunto W, Manber U, Wu S (1994) Proximity matching using fixed-queries trees. In: Crochemore M, Gusfield D (eds) Combinatorial pattern matching. Lecture Notes in Computer Science, vol 807. Springer, Berlin Heidelberg, pp 198–212CrossRef Baeza-Yates R, Cunto W, Manber U, Wu S (1994) Proximity matching using fixed-queries trees. In: Crochemore M, Gusfield D (eds) Combinatorial pattern matching. Lecture Notes in Computer Science, vol 807. Springer, Berlin Heidelberg, pp 198–212CrossRef
3.
go back to reference Bartolini I, Ciaccia P, Patella M (2002) String matching with metric trees using an approximate distance. In: String processing and information retrieval. Springer, Berlin Heidelberg, pp 423–431 Bartolini I, Ciaccia P, Patella M (2002) String matching with metric trees using an approximate distance. In: String processing and information retrieval. Springer, Berlin Heidelberg, pp 423–431
4.
go back to reference Bozkaya T, Ozsoyoglu M (1997) Distance-based indexing for high-dimensional metric spaces. ACM SIGMOD Rec 26(2):357–368CrossRef Bozkaya T, Ozsoyoglu M (1997) Distance-based indexing for high-dimensional metric spaces. ACM SIGMOD Rec 26(2):357–368CrossRef
5.
go back to reference Brin S (1995) Near neighbor search in large metric spaces. In: Very Large Data Bases (VLDB). IEEE, Washington, D.C, pp 574–584 (Conference paper) Brin S (1995) Near neighbor search in large metric spaces. In: Very Large Data Bases (VLDB). IEEE, Washington, D.C, pp 574–584 (Conference paper)
6.
go back to reference Burkhard WA, Keller RM (1973) Some approaches to best-match file searching. Commun ACM 16(4):230–236CrossRefMATH Burkhard WA, Keller RM (1973) Some approaches to best-match file searching. Commun ACM 16(4):230–236CrossRefMATH
7.
go back to reference Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14):2357–2366CrossRefMATH Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14):2357–2366CrossRefMATH
8.
go back to reference Chávez E, Navarro G (2000) An effective clustering algorithm to index high dimensional metric spaces. Proc. 7th International Symposium on String Processing and Information Retrieval. IEEE CS Press, Washington DC, pp 75–86 Chávez E, Navarro G (2000) An effective clustering algorithm to index high dimensional metric spaces. Proc. 7th International Symposium on String Processing and Information Retrieval. IEEE CS Press, Washington DC, pp 75–86
9.
go back to reference Chávez E, Marroquín J, Navarro G (1999) Overcoming the curse of dimensionality. European Workshop on Content-based Multimedia Indexing (CBMI 99), pp 57–64 Chávez E, Marroquín J, Navarro G (1999) Overcoming the curse of dimensionality. European Workshop on Content-based Multimedia Indexing (CBMI 99), pp 57–64
10.
go back to reference Chávez E, Marroquín J, Navarro G (2001) Fixed queries array: a fast and economical data structure for proximity searching. Multimed Tools Appl 14(2):113–135CrossRefMATH Chávez E, Marroquín J, Navarro G (2001) Fixed queries array: a fast and economical data structure for proximity searching. Multimed Tools Appl 14(2):113–135CrossRefMATH
11.
go back to reference Ciaccia P, Patella M (1998) Bulk loading the M‑tree. Proc. 9th Australasian Database Conf. (ADC), pp 15–26MATH Ciaccia P, Patella M (1998) Bulk loading the M‑tree. Proc. 9th Australasian Database Conf. (ADC), pp 15–26MATH
12.
go back to reference Ciaccia P, Patella M (2000) The M2-tree: processing complex multi-feature queries with just one index. DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries. Ciaccia P, Patella M (2000) The M2-tree: processing complex multi-feature queries with just one index. DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries.
13.
go back to reference Ciaccia P, Patella M, Zezula P (1997) M‑tree: an efficient access method for similarity search in metric spaces. In: Very Large Data Bases (VLDB). ACM Press, New York, pp 426–435 (Conference paper) Ciaccia P, Patella M, Zezula P (1997) M‑tree: an efficient access method for similarity search in metric spaces. In: Very Large Data Bases (VLDB). ACM Press, New York, pp 426–435 (Conference paper)
14.
go back to reference Deepak P, Deshpande PM (2015) Operators for similarity search: semantics, techniques and usage scenarios. Springer, Berlin Heidelberg Deepak P, Deshpande PM (2015) Operators for similarity search: semantics, techniques and usage scenarios. Springer, Berlin Heidelberg
15.
go back to reference Dehne F, Noltemeier H (1988) Voronoi trees and clustering problems. In: Ferrate G, Pavlidis T, Sanfeliu A, Bunke H (eds) Syntactic and structural pattern recognition. NATO ASI Series, vol 45. Springer, Berlin Heidelberg, pp 185–194CrossRef Dehne F, Noltemeier H (1988) Voronoi trees and clustering problems. In: Ferrate G, Pavlidis T, Sanfeliu A, Bunke H (eds) Syntactic and structural pattern recognition. NATO ASI Series, vol 45. Springer, Berlin Heidelberg, pp 185–194CrossRef
16.
go back to reference Dohnal V, Gennaro C, Savino P, Zezula P (2003a) D‑index: distance searching index for metric data sets. Multimed Tools Appl 21(1):9–33CrossRef Dohnal V, Gennaro C, Savino P, Zezula P (2003a) D‑index: distance searching index for metric data sets. Multimed Tools Appl 21(1):9–33CrossRef
17.
go back to reference Dohnal V, Gennaro C, Zezula P (2003b) Similarity join in metric spaces using eD-index. In: Database and Expert Systems Applications (DEXA). Springer, Berlin Heidelberg, pp 484–493 (Conference paper)CrossRef Dohnal V, Gennaro C, Zezula P (2003b) Similarity join in metric spaces using eD-index. In: Database and Expert Systems Applications (DEXA). Springer, Berlin Heidelberg, pp 484–493 (Conference paper)CrossRef
18.
go back to reference Fenz D, Lange D, Rheinländer A, Naumann F, Leser U (2012) Efficient similarity search in very large string sets. In: Scientific and Statistical Database Management (SSDBM) Chania. (Conference paper) Fenz D, Lange D, Rheinländer A, Naumann F, Leser U (2012) Efficient similarity search in very large string sets. In: Scientific and Statistical Database Management (SSDBM) Chania. (Conference paper)
19.
go back to reference Guhlemann S (2016) Neue Indexverfahren für die Ähnlichkeitssuche in metrischen Räumen über großen Datenmengen. PhD thesis, TU Dresden Guhlemann S (2016) Neue Indexverfahren für die Ähnlichkeitssuche in metrischen Räumen über großen Datenmengen. PhD thesis, TU Dresden
20.
go back to reference Hetland ML (2009) The basic principles of metric indexing. Springer, Berlin Heidelberg, pp 199–232 Hetland ML (2009) The basic principles of metric indexing. Springer, Berlin Heidelberg, pp 199–232
21.
go back to reference Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS) 30(2):364–397CrossRef Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS) 30(2):364–397CrossRef
23.
go back to reference Lange D, Vogel T, Draisbach U, Naumann F (2011) Projektseminar ’Similarity Search Algorithms’. Datenbank Spektrum 11(1):51–57CrossRef Lange D, Vogel T, Draisbach U, Naumann F (2011) Projektseminar ’Similarity Search Algorithms’. Datenbank Spektrum 11(1):51–57CrossRef
24.
go back to reference Micó M, Oncina J, Vidal E (1994) A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognit Lett 15(1):9–17CrossRef Micó M, Oncina J, Vidal E (1994) A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognit Lett 15(1):9–17CrossRef
25.
go back to reference Noltemeier H, Verbarg K, Zirkelbach C (1992) Monotonous Bisector* Trees – a tool for efficient partitioning of complex scenes of geometric objects. In: Monien B, Ottmann T (eds) Data structures and efficient algorithms. Lecture Notes in Computer Science, vol 594, pp 186–203CrossRef Noltemeier H, Verbarg K, Zirkelbach C (1992) Monotonous Bisector* Trees – a tool for efficient partitioning of complex scenes of geometric objects. In: Monien B, Ottmann T (eds) Data structures and efficient algorithms. Lecture Notes in Computer Science, vol 594, pp 186–203CrossRef
26.
go back to reference Noltemeier H, Verbarg K, Zirkelbach C (1993) A data structure for representing and efficient querying large scenes of geometric objects: MB* trees. In: Farin G, Hagen H, Noltemeier H, Knödel W (eds) Geometric modelling. Springer, Vienna, pp 211–226CrossRef Noltemeier H, Verbarg K, Zirkelbach C (1993) A data structure for representing and efficient querying large scenes of geometric objects: MB* trees. In: Farin G, Hagen H, Noltemeier H, Knödel W (eds) Geometric modelling. Springer, Vienna, pp 211–226CrossRef
27.
go back to reference Novak D, Zezula P (2014) Rank aggregation of candidate sets for efficient similarity search. In: Decker H, Lhotská L, Link S, Spies M, Wagner RR (eds) Database and Expert Systems Applications (DEXA). Lecture Notes in Computer Science, vol 8645. Springer, Cham, pp 42–58 (Conference paper) Novak D, Zezula P (2014) Rank aggregation of candidate sets for efficient similarity search. In: Decker H, Lhotská L, Link S, Spies M, Wagner RR (eds) Database and Expert Systems Applications (DEXA). Lecture Notes in Computer Science, vol 8645. Springer, Cham, pp 42–58 (Conference paper)
28,.
go back to reference Novak D, Batko M, Zezula P (2011) Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf Syst 36(4):721–733CrossRef Novak D, Batko M, Zezula P (2011) Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf Syst 36(4):721–733CrossRef
29.
go back to reference Patella M (1999) Similarity search in multimedia databases. Dipartmento di Elettronica Informatica e Sistemistica, Bologna Patella M (1999) Similarity search in multimedia databases. Dipartmento di Elettronica Informatica e Sistemistica, Bologna
30.
go back to reference Rheinländer A, Knobloch M, Hochmuth N, Leser U (2010) Prefix tree indexing for similarity search and similarity joins on genomic data. In: Int. Conf. on Scientific and Statistical Database Management (SSDBM). Springer, Cham, pp 519–536 (Conference paper)CrossRef Rheinländer A, Knobloch M, Hochmuth N, Leser U (2010) Prefix tree indexing for similarity search and similarity joins on genomic data. In: Int. Conf. on Scientific and Statistical Database Management (SSDBM). Springer, Cham, pp 519–536 (Conference paper)CrossRef
31.
go back to reference Sedmidubsky J, Mic V, Zezula P (2015) Face image retrieval revisited. In: Int. Conf. on Similarity Search and Applications. Springer, Cham, pp 204–216 (Conference paper)CrossRef Sedmidubsky J, Mic V, Zezula P (2015) Face image retrieval revisited. In: Int. Conf. on Similarity Search and Applications. Springer, Cham, pp 204–216 (Conference paper)CrossRef
32.
go back to reference Shapiro M (1977) The choice of reference points in best-match file searching. Commun ACM 20(5):339–343CrossRef Shapiro M (1977) The choice of reference points in best-match file searching. Commun ACM 20(5):339–343CrossRef
33.
go back to reference Skopal T (2004) Pivoting M‑tree: a metric access method for efficient similarity search. In: Dateso Annual Int. Workshop on Databases, Texts, Specifications and Objects Desna, 14.-16. April 2004. Desna, pp 27–37 (Conference paper) Skopal T (2004) Pivoting M‑tree: a metric access method for efficient similarity search. In: Dateso Annual Int. Workshop on Databases, Texts, Specifications and Objects Desna, 14.-16. April 2004. Desna, pp 27–37 (Conference paper)
34.
go back to reference Skopal T, Pokornỳ J, Krátkỳ M, Snášel V (2003) Revisiting M‑tree building principles. In: Advances in Databases and Information Systems. Lecture Notes in Computer Science, vol 2798. Springer, Berlin Heidelberg, pp 148–162CrossRef Skopal T, Pokornỳ J, Krátkỳ M, Snášel V (2003) Revisiting M‑tree building principles. In: Advances in Databases and Information Systems. Lecture Notes in Computer Science, vol 2798. Springer, Berlin Heidelberg, pp 148–162CrossRef
35.
go back to reference Traina C, Traina A, Seeger B, Faloutsos C (2000) Slim-trees: High performance metric trees minimizing overlap between nodes. In: Advances in Database Technology, EDBT. Lecture Notes in Computer Science, vol 1777. Springer, Berlin Heidelberg, pp 51–65 Traina C, Traina A, Seeger B, Faloutsos C (2000) Slim-trees: High performance metric trees minimizing overlap between nodes. In: Advances in Database Technology, EDBT. Lecture Notes in Computer Science, vol 1777. Springer, Berlin Heidelberg, pp 51–65
36.
go back to reference Traina C Jr, Traina A, Faloutsos C, Seeger B (2002) Fast indexing and visualization of metric data sets using slim-trees. Knowl Data Eng IEEE Trans 14(2):244–260CrossRef Traina C Jr, Traina A, Faloutsos C, Seeger B (2002) Fast indexing and visualization of metric data sets using slim-trees. Knowl Data Eng IEEE Trans 14(2):244–260CrossRef
37.
go back to reference Uhlmann J (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179CrossRefMATH Uhlmann J (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179CrossRefMATH
38.
go back to reference Vidal E (1986) An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognit Lett 4(3):145–157CrossRef Vidal E (1986) An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognit Lett 4(3):145–157CrossRef
39.
go back to reference Yianilos P (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: 4th Annual ACM-SIAM Symp. on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp 311–321 (Conference paper) Yianilos P (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: 4th Annual ACM-SIAM Symp. on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp 311–321 (Conference paper)
40.
go back to reference Yianilos PN (1999) Excluded middle vantage point forests for nearest neighbor search. In: 6th DIMACS Implementation Challenge: Near Neighbor Searches (ALENEX) Baltimore. (Conference paper) Yianilos PN (1999) Excluded middle vantage point forests for nearest neighbor search. In: 6th DIMACS Implementation Challenge: Near Neighbor Searches (ALENEX) Baltimore. (Conference paper)
41.
go back to reference Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Springer, BerlinMATH Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Springer, BerlinMATH
42.
go back to reference Zhou X, Wang G, Zhou X, Yu G (2005) BM+-tree: a hyperplane-based index method for high-dimensional metric spaces. In: Database Systems for Advanced Applications. Springer, Berlin Heidelberg, pp 398–409CrossRef Zhou X, Wang G, Zhou X, Yu G (2005) BM+-tree: a hyperplane-based index method for high-dimensional metric spaces. In: Database Systems for Advanced Applications. Springer, Berlin Heidelberg, pp 398–409CrossRef
43.
go back to reference Zierenberg M, Schmitt I (2015) Optimizing the Distance Computation Order of Multi-Feature Similarity Search Indexing. In: Int. Conf. on Similarity Search and Applications. Springer, New York, pp 90–96 Zierenberg M, Schmitt I (2015) Optimizing the Distance Computation Order of Multi-Feature Similarity Search Indexing. In: Int. Conf. on Similarity Search and Applications. Springer, New York, pp 90–96
Metadata
Title
Reducing the Distance Calculations when Searching an M‑Tree
Authors
Steffen Guhlemann
Uwe Petersohn
Klaus Meyer-Wegener
Publication date
20-06-2017
Publisher
Springer Berlin Heidelberg
Published in
Datenbank-Spektrum / Issue 2/2017
Print ISSN: 1618-2162
Electronic ISSN: 1610-1995
DOI
https://doi.org/10.1007/s13222-017-0258-5

Other articles of this Issue 2/2017

Datenbank-Spektrum 2/2017 Go to the issue

Editorial

Editorial

Premium Partner