Skip to main content
Erschienen in: Knowledge and Information Systems 1/2016

01.10.2016 | Regular Paper

An efficient pruning strategy for approximate string matching over suffix tree

verfasst von: Huan Hu, Hongzhi Wang, Jianzhong Li, Hong Gao

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches \(m+k\) characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Baeza-Yates RA, Gonnet GH (1999) A fast algorithm on average for all-against-all sequence matching. In: SPIRE/CRIWG. pp 16–23 Baeza-Yates RA, Gonnet GH (1999) A fast algorithm on average for all-against-all sequence matching. In: SPIRE/CRIWG. pp 16–23
3.
4.
Zurück zum Zitat Bunke H, Bühler U (1993) Applications of approximate string matching to 2d shape recognition. Pattern Recognit 26(12):1797–1812CrossRef Bunke H, Bühler U (1993) Applications of approximate string matching to 2d shape recognition. Pattern Recognit 26(12):1797–1812CrossRef
5.
Zurück zum Zitat Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1 1992, Proceedings, pp 175–184 Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1 1992, Proceedings, pp 175–184
7.
Zurück zum Zitat French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: Proceedings of the sixth international conference on information and knowledge management (CIKM’97), Las Vegas, Nevada, November 10–14 1997, pp 9–15 French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: Proceedings of the sixth international conference on information and knowledge management (CIKM’97), Las Vegas, Nevada, November 10–14 1997, pp 9–15
9.
10.
Zurück zum Zitat Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710MathSciNetMATH Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710MathSciNetMATH
11.
Zurück zum Zitat Mitra S, Acharya T, Luo J (2006) Data mining: multimedia, soft computing, and bioinformatics. J. Electron Imaging 15(1):019901CrossRef Mitra S, Acharya T, Luo J (2006) Data mining: multimedia, soft computing, and bioinformatics. J. Electron Imaging 15(1):019901CrossRef
12.
Zurück zum Zitat Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRef Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRef
13.
Zurück zum Zitat Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discrete Algorithms 1(1):205–239MathSciNet Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discrete Algorithms 1(1):205–239MathSciNet
14.
Zurück zum Zitat Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, CambridgeCrossRefMATH Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, CambridgeCrossRefMATH
15.
16.
Zurück zum Zitat Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: ISMIR 2005, 6th international conference on music information retrieval, London, UK, 11–15 Sept 2005, Proceedings, pp 153–160 Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: ISMIR 2005, 6th international conference on music information retrieval, London, UK, 11–15 Sept 2005, Proceedings, pp 153–160
18.
Zurück zum Zitat Ukkonen E (1993) Approximate string-matching over suffix trees. In: Combinatorial pattern matching, 4th annual symposium, CPM 93, Padova, Italy, 2–4 June 1993, Proceedings, pp 228–242 Ukkonen E (1993) Approximate string-matching over suffix trees. In: Combinatorial pattern matching, 4th annual symposium, CPM 93, Padova, Italy, 2–4 June 1993, Proceedings, pp 228–242
20.
Zurück zum Zitat Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. PVLDB 6(13):1534–1545 Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. PVLDB 6(13):1534–1545
21.
Zurück zum Zitat Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–91CrossRef Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–91CrossRef
22.
Zurück zum Zitat Yang X, Wang B, Li C, Wang J, Xie X (2013) Efficient direct search on compressed genomic data. In: 29th IEEE international conference on data engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp 961–972 Yang X, Wang B, Li C, Wang J, Xie X (2013) Efficient direct search on compressed genomic data. In: 29th IEEE international conference on data engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp 961–972
23.
Zurück zum Zitat Zhang Q, Chamberlain RD, Indeck RS, West BM, White J (2004) Massively parallel data mining using reconfigurable hardware: approximate string matching. In: 18th International parallel and distributed processing symposium (IPDPS 2004) CD-ROM/Abstracts proceedings, 26–30 April 2004. Santa Fe, New Mexico, USA Zhang Q, Chamberlain RD, Indeck RS, West BM, White J (2004) Massively parallel data mining using reconfigurable hardware: approximate string matching. In: 18th International parallel and distributed processing symposium (IPDPS 2004) CD-ROM/Abstracts proceedings, 26–30 April 2004. Santa Fe, New Mexico, USA
24.
Zurück zum Zitat Zobel J, Dart PW (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’96, 18–22 August 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pp 166–172 Zobel J, Dart PW (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’96, 18–22 August 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pp 166–172
Metadaten
Titel
An efficient pruning strategy for approximate string matching over suffix tree
verfasst von
Huan Hu
Hongzhi Wang
Jianzhong Li
Hong Gao
Publikationsdatum
01.10.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-015-0896-6

Weitere Artikel der Ausgabe 1/2016

Knowledge and Information Systems 1/2016 Zur Ausgabe