Skip to main content
Erschienen in: Knowledge and Information Systems 3/2016

01.03.2016 | Regular Paper

Extend tree edit distance for effective object identification

verfasst von: Yue Wang, Hongzhi Wang, Liyan Zhang, Yang Wang, Jianzhong Li, Hong Gao

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Similarity join on XML documents which are usually modeled as rooted ordered labeled trees is widely applied, due to the ambiguity of references to the real-world objects. The conventional method dealing with this issue is based on tree edit distance, which is shortage of flexibility and efficiency. In this paper, we propose two novel edit operations together with extended tree edit distance, which can achieve good performance in similarity matching with hierarchical data structures [the run-time is \(O(n^{3})\) in the worst case]. And then, we propose \(k\)-generation set distance as a good approximation of the tree edit distance to further improve the join efficiency with quadric time complexity. Experiments on real and synthetic databases demonstrate the benefit of our method in efficiency and scalability.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Algergawy A, Nayak R, Saake R (2010) Element similarity measures in XML schema matching. J Inf Sci 180(24):4975–4998CrossRef Algergawy A, Nayak R, Saake R (2010) Element similarity measures in XML schema matching. J Inf Sci 180(24):4975–4998CrossRef
2.
Zurück zum Zitat Augsten N, Bohlen M, Gamper J (2005) Approximate matching of hierarchical data using pq-grams. In: Proceeding of the 31st VLDB conferences, Trondheim, pp 301–312 Augsten N, Bohlen M, Gamper J (2005) Approximate matching of hierarchical data using pq-grams. In: Proceeding of the 31st VLDB conferences, Trondheim, pp 301–312
3.
Zurück zum Zitat Augsten N, Bohlen M, Gamper J (2010) The pq-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):4(1)–4(36) Augsten N, Bohlen M, Gamper J (2010) The pq-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):4(1)–4(36)
5.
Zurück zum Zitat Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Arge L, Cachin C, Jurdzinski T, Tarlecki A (eds) ICALP, LNCS, vol 4596. Springer, Heidelberg, pp 146–157 Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Arge L, Cachin C, Jurdzinski T, Tarlecki A (eds) ICALP, LNCS, vol 4596. Springer, Heidelberg, pp 146–157
6.
Zurück zum Zitat Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceeding of the 14th annual symposium on Combinatorial Pattern Matching (CPM), pp 83–95 Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceeding of the 14th annual symposium on Combinatorial Pattern Matching (CPM), pp 83–95
7.
Zurück zum Zitat Garofalakis M, Kumar A (2005) XML stream processing using tree-edit distance embeddings. ACM Trans Database Syst 30(1):279–332CrossRef Garofalakis M, Kumar A (2005) XML stream processing using tree-edit distance embeddings. ACM Trans Database Syst 30(1):279–332CrossRef
8.
Zurück zum Zitat Guha S, Jagadish HV, Koudas N et al (2002) Approximate XML joins. ACM SIGMOD, Madison, Wisconsin Guha S, Jagadish HV, Koudas N et al (2002) Approximate XML joins. ACM SIGMOD, Madison, Wisconsin
9.
Zurück zum Zitat Guha S, Jagadish HV, Koudas N et al (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207CrossRef Guha S, Jagadish HV, Koudas N et al (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207CrossRef
10.
Zurück zum Zitat Han Z, Wang H, Gao H et al (2009) Clustering-based approximate join method on XML documents. J Comput Res Dev. ISSN1000-1239/CN 11–1177/TP46(Suppl.): 81–86 Han Z, Wang H, Gao H et al (2009) Clustering-based approximate join method on XML documents. J Comput Res Dev. ISSN1000-1239/CN 11–1177/TP46(Suppl.): 81–86
11.
Zurück zum Zitat Kailing K, Kriegel H, Schonauer S et al (2004) Efficient similarity search for hierarchical data in large databases. In: Bertino E, Christodoulakis S, Plexousakis D, Vassilis C, Koubarakis M, Böhm K, Ferrari E (eds) Advances in database technology - EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Heidelberg, pp 676–693 Kailing K, Kriegel H, Schonauer S et al (2004) Efficient similarity search for hierarchical data in large databases. In: Bertino E, Christodoulakis S, Plexousakis D, Vassilis C, Koubarakis M, Böhm K, Ferrari E (eds) Advances in database technology - EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Heidelberg, pp 676–693
12.
Zurück zum Zitat Klein PH (1998) Computing the edit-distance between unrooted ordered trees. ESA’98, LNCS 1461:91–102 Klein PH (1998) Computing the edit-distance between unrooted ordered trees. ESA’98, LNCS 1461:91–102
13.
Zurück zum Zitat Li F, Wang H, Hao L et al (2010) pq-hash: an efficient method for approximate XML joins. WAIM 2010 Workshops LNCS 6185:125–134 Li F, Wang H, Hao L et al (2010) pq-hash: an efficient method for approximate XML joins. WAIM 2010 Workshops LNCS 6185:125–134
14.
Zurück zum Zitat Li F, Wang H, Zhang C et al (2010) Approximate joins for XML using \(g\)-string. XSym 2010, LNCS 6309, pp. 3–17 Li F, Wang H, Zhang C et al (2010) Approximate joins for XML using \(g\)-string. XSym 2010, LNCS 6309, pp. 3–17
15.
Zurück zum Zitat Mozes S (2008) Some lower and upper bounds for tree edit distance. Department of Computer Science, Brown University, Providence, RI Mozes S (2008) Some lower and upper bounds for tree edit distance. Department of Computer Science, Brown University, Providence, RI
17.
Zurück zum Zitat Tatikonda S, Parthasarathy S (2010) Hashing tree-structured data: methods and applications. In: IEEE 26th international conference on data engineering (ICDE), pp 429–440 Tatikonda S, Parthasarathy S (2010) Hashing tree-structured data: methods and applications. In: IEEE 26th international conference on data engineering (ICDE), pp 429–440
18.
Zurück zum Zitat Wang Y, Wang H, Wang Y et al (2012) Similarity join on XML based on \(k\)-generation set distance. WAIM 2011 Workshops, LNCS 7142, Springer, Heidelberg, pp 124–135 Wang Y, Wang H, Wang Y et al (2012) Similarity join on XML based on \(k\)-generation set distance. WAIM 2011 Workshops, LNCS 7142, Springer, Heidelberg, pp 124–135
19.
Zurück zum Zitat Yang Z, Yang G (2004) A near-optimal similarity join algorithm and performance evaluation. J Inf Sci 167(1–4):87–108CrossRefMATH Yang Z, Yang G (2004) A near-optimal similarity join algorithm and performance evaluation. J Inf Sci 167(1–4):87–108CrossRefMATH
20.
Zurück zum Zitat Yi S, Huang B, Chan WT (2005) XML application schema matching using similarity measure and relaxation labeling. J Inf Sci 169(1–2):27–46CrossRefMATH Yi S, Huang B, Chan WT (2005) XML application schema matching using similarity measure and relaxation labeling. J Inf Sci 169(1–2):27–46CrossRefMATH
21.
Zurück zum Zitat Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262CrossRefMathSciNetMATH Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262CrossRefMathSciNetMATH
22.
Zurück zum Zitat Shasha D, Zhang K (1995) Approximate tree pattern matching. Pattern matching in strings, trees and arrays. chapter 14, Oxford University Press Shasha D, Zhang K (1995) Approximate tree pattern matching. Pattern matching in strings, trees and arrays. chapter 14, Oxford University Press
23.
Zurück zum Zitat Chawathe SS (1999) Comparing hierarchical data in external memory. In: Proceedings of the twenty-fifth International conference on very large data bases (VLDB), pp 90–101 Chawathe SS (1999) Comparing hierarchical data in external memory. In: Proceedings of the twenty-fifth International conference on very large data bases (VLDB), pp 90–101
24.
Zurück zum Zitat Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In Proceedings of the 5th international workshop on the Web and Databases Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In Proceedings of the 5th international workshop on the Web and Databases
25.
Zurück zum Zitat Flesca S, Manco G, Masciari G et al (2002) Detecting structural similarities between XML documents. In: Proceedings of WebDB, pp 55–60 Flesca S, Manco G, Masciari G et al (2002) Detecting structural similarities between XML documents. In: Proceedings of WebDB, pp 55–60
Metadaten
Titel
Extend tree edit distance for effective object identification
verfasst von
Yue Wang
Hongzhi Wang
Liyan Zhang
Yang Wang
Jianzhong Li
Hong Gao
Publikationsdatum
01.03.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-014-0816-1

Weitere Artikel der Ausgabe 3/2016

Knowledge and Information Systems 3/2016 Zur Ausgabe

Premium Partner