Skip to main content
Erschienen in: Pattern Analysis and Applications 1/2007

01.02.2007 | Theoretical Advances

Breadth-first search strategies for trie-based syntactic pattern recognition

verfasst von: B. John Oommen, Ghada Badr

Erschienen in: Pattern Analysis and Applications | Ausgabe 1/2007

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Dictionary-based syntactic pattern recognition of strings attempts to recognize a transmitted string X *, by processing its noisy version, Y, without sequentially comparing Y with every element X in the finite, (but possibly, large) dictionary, H. The best estimate X + of X *, is defined as that element of H which minimizes the generalized Levenshtein distance (GLD) D(X, Y) between X and Y, for all XH. The non-sequential PR computation of X + involves a compact trie-based representation of H. In this paper, we show how we can optimize this computation by incorporating breadth first search schemes on the underlying graph structure. This heuristic emerges from the trie-based dynamic programming recursive equations, which can be effectively implemented using a new data structure called the linked list of prefixes that can be built separately or “on top of” the trie representation of H. The new scheme does not restrict the number of errors in Y to be merely a small constant, as is done in most of the available methods. The main contribution is that our new approach can be used for generalized GLDs and not merely for 0/1 costs. It is also applicable when all possible correct candidates need to be known, and not just the best match. These constitute the cases when the “cutoffs” cannot be used in the DFS trie-based technique (Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996). The new technique is compared with the DFS trie-based technique (Risvik in United Patent 6377945 B1, 23 April 2002; Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996) using three large and small benchmark dictionaries with different errors. In each case, we demonstrate marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy. Additionally, some further improvements can be obtained by introducing the knowledge of the maximum number or percentage of errors in Y.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
In terms of notation, A is a finite alphabet, H is a finite (but possibly large) dictionary, and μ is the null string, distinct from λ, the null symbol. The left derivative of order one of any string Z = z 1 z 2 ... z k is the string Z p  = z 1 z 2 ... z k-1. The left derivative of order two of Z is the left derivative of order one of Z p , and so on.
 
5
A BFS technique was earlier shown in [29] to be much more superior to a method which computes X + using sequential comparison between every XH and Y. The fact that it is also uniformly superior to a DFS-method is what we demonstrate here.
 
Literatur
1.
Zurück zum Zitat Acharya A, Zhu H, Shen K (1999) Adaptive algorithms for cache-efficient trie search. In: ACM and SIAM workshop on algorithm engineering and experimentation, January 1999, pp 296–311 Acharya A, Zhu H, Shen K (1999) Adaptive algorithms for cache-efficient trie search. In: ACM and SIAM workshop on algorithm engineering and experimentation, January 1999, pp 296–311
2.
Zurück zum Zitat Amengual JC, Vidal E (1998) Efficient error-correcting viterbi parsing. IEEE Trans Commun 20(10):1109–1116 Amengual JC, Vidal E (1998) Efficient error-correcting viterbi parsing. IEEE Trans Commun 20(10):1109–1116
3.
Zurück zum Zitat Amengual JC, Vidal E (1998) The viterbi algorithm. IEEE Trans Pattern Anal Mach Intell 20(10):268–278CrossRef Amengual JC, Vidal E (1998) The viterbi algorithm. IEEE Trans Pattern Anal Mach Intell 20(10):268–278CrossRef
4.
Zurück zum Zitat Baeza-Yates R, Navarro G (1998) Fast approximate string matching in a dictionary. In: Proceedings of the 5th South American symposium on string processing and information retrieval (SPIRE’98), IEEE CS Press, pp 14–22 Baeza-Yates R, Navarro G (1998) Fast approximate string matching in a dictionary. In: Proceedings of the 5th South American symposium on string processing and information retrieval (SPIRE’98), IEEE CS Press, pp 14–22
5.
Zurück zum Zitat Bentley J, Sedgewick R (1997) Fast algorithms for sorting and searching strings. In: 8th annual ACM-SIAM symposium on discrete algorithms, New Orleans, January 1997, pp 360–369 Bentley J, Sedgewick R (1997) Fast algorithms for sorting and searching strings. In: 8th annual ACM-SIAM symposium on discrete algorithms, New Orleans, January 1997, pp 360–369
6.
Zurück zum Zitat Bouloutas A, Hart GW, Schwartz M (1991) Two extensions of the viterbi algorithm. IEEE Trans Inf Theory 37(2):430–436CrossRefMathSciNet Bouloutas A, Hart GW, Schwartz M (1991) Two extensions of the viterbi algorithm. IEEE Trans Inf Theory 37(2):430–436CrossRefMathSciNet
7.
Zurück zum Zitat Bucher P, Hoffmann K (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, ISMB, vol 96. AAAI Press, Menlo Park, pp 44–51 Bucher P, Hoffmann K (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, ISMB, vol 96. AAAI Press, Menlo Park, pp 44–51
8.
Zurück zum Zitat Bunke H (1993) Structural and syntactic pattern recognition. In: Chen CH, Pau LF, Wang PSP (eds) Handbook of pattern recognition and computer vision. World Scientific, Singapore Bunke H (1993) Structural and syntactic pattern recognition. In: Chen CH, Pau LF, Wang PSP (eds) Handbook of pattern recognition and computer vision. World Scientific, Singapore
10.
Zurück zum Zitat Bunke H, Csirik J (1993) Parametric string edit distance and its application to pattern recognition. IEEE Trans Syst Man Cybern 25(1):202–206CrossRef Bunke H, Csirik J (1993) Parametric string edit distance and its application to pattern recognition. IEEE Trans Syst Man Cybern 25(1):202–206CrossRef
11.
Zurück zum Zitat Clement J, Flajolet P, Vallee B (1998) The analysis of hybrid trie structures. In: Proceedings of the annual a CM-SIAM symposium on discrete algorithms, San Francisco, California, 1998, pp 531–539 Clement J, Flajolet P, Vallee B (1998) The analysis of hybrid trie structures. In: Proceedings of the annual a CM-SIAM symposium on discrete algorithms, San Francisco, California, 1998, pp 531–539
12.
Zurück zum Zitat Cole R, Gottieb L, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In; Proceedings of the 36th annual ACM aymposium on theory of computing, Chicago, IL, USA, June 2004, pp 91–100 Cole R, Gottieb L, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In; Proceedings of the 36th annual ACM aymposium on theory of computing, Chicago, IL, USA, June 2004, pp 91–100
13.
Zurück zum Zitat Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. The MIT Press, CambridgeMATH Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. The MIT Press, CambridgeMATH
14.
Zurück zum Zitat Dewey G (1923) Relative frequency of English speech sounds. Harvard University Press, MA Dewey G (1923) Relative frequency of English speech sounds. Harvard University Press, MA
15.
Zurück zum Zitat Du M, Chang S (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633CrossRef Du M, Chang S (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633CrossRef
17.
Zurück zum Zitat Fuketa M, Sumitomo T, Shishibori M, Aoe J (1999) A suffix compression algorithm of tries. In: ICCPOL’99: 18th international conference on computer processing of original languages vol 18, pp 345–348 Fuketa M, Sumitomo T, Shishibori M, Aoe J (1999) A suffix compression algorithm of tries. In: ICCPOL’99: 18th international conference on computer processing of original languages vol 18, pp 345–348
18.
Zurück zum Zitat Kashyap RL, Oommen BJ (1981) An effective algorithm for string correction using generalized edit distances −i. Description of the algorithm and its optimality. Inf Sci 23(2):123–142CrossRef Kashyap RL, Oommen BJ (1981) An effective algorithm for string correction using generalized edit distances  −i. Description of the algorithm and its optimality. Inf Sci 23(2):123–142CrossRef
19.
Zurück zum Zitat Kashyap RL, Oommen BJ (1984) String correction using probabilistic methods. Pattern Recognit Lett pp 147–154 Kashyap RL, Oommen BJ (1984) String correction using probabilistic methods. Pattern Recognit Lett pp 147–154
20.
Zurück zum Zitat Levenshtein A (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Dokl 10:707–710MathSciNet Levenshtein A (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Dokl 10:707–710MathSciNet
22.
Zurück zum Zitat Mibov S, Schulz K (2002) Fast approximate string matching in large dictionaries. Available: www.cis.uni-muenchen.de//people//schulz//pub//fastapproxsearch.pdf Mibov S, Schulz K (2002) Fast approximate string matching in large dictionaries. Available: www.cis.uni-muenchen.de//people//schulz//pub//fastapproxsearch.pdf
23.
Zurück zum Zitat Miclet L (1990) Grammatical inference. Syntactic Struct Pattern Recognit Appl 237–290 Miclet L (1990) Grammatical inference. Syntactic Struct Pattern Recognit Appl 237–290
24.
Zurück zum Zitat Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surveys 33(1):31–88CrossRef Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surveys 33(1):31–88CrossRef
25.
Zurück zum Zitat Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–89 Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–89
26.
Zurück zum Zitat Okuda T, Tanaka E, Kasai T (1976) A method of correction of garbled words based on the levenshtein metric. IEEE Trans Comput 25:172–177MathSciNetMATHCrossRef Okuda T, Tanaka E, Kasai T (1976) A method of correction of garbled words based on the levenshtein metric. IEEE Trans Comput 25:172–177MathSciNetMATHCrossRef
28.
Zurück zum Zitat Oommen BJ (1987) Recognition of noisy subsequences using constrained edit distances. IEEE Trans Pattern Anal Mach Intell 9:676–685MATHCrossRef Oommen BJ (1987) Recognition of noisy subsequences using constrained edit distances. IEEE Trans Pattern Anal Mach Intell 9:676–685MATHCrossRef
29.
Zurück zum Zitat Oommen BJ, Badr G (2004) Dictionary-based syntactic pattern recognition using tries. In: Proceedings of the joint IARR international workshops SSPR 2004 and SPR 2004, Libon, Portugal, August 2004, pp 251–259 Oommen BJ, Badr G (2004) Dictionary-based syntactic pattern recognition using tries. In: Proceedings of the joint IARR international workshops SSPR 2004 and SPR 2004, Libon, Portugal, August 2004, pp 251–259
30.
Zurück zum Zitat Oommen BJ, Kashyap RL (1998) A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognit 31:1159–1177CrossRef Oommen BJ, Kashyap RL (1998) A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognit 31:1159–1177CrossRef
31.
Zurück zum Zitat Oommen BJ, Loke RKS. Syntactic pattern recognition involving traditional and generalized transposition errors: Attaining the information theoretic bound. (Submitted) Oommen BJ, Loke RKS. Syntactic pattern recognition involving traditional and generalized transposition errors: Attaining the information theoretic bound. (Submitted)
32.
Zurück zum Zitat Oommen BJ, Loke RKS (1997) Pattern recognition of strings with substitutions, insertions, deletions and generalized transposition. Pattern Recognit 30:789–800CrossRef Oommen BJ, Loke RKS (1997) Pattern recognition of strings with substitutions, insertions, deletions and generalized transposition. Pattern Recognit 30:789–800CrossRef
33.
Zurück zum Zitat Oommen BJ, Loke RKS (1999) Designing syntactic pattern classifiers using vector quantization and parametric string editing. IEEE Trans Syst Man Cybern 29:881–888 Oommen BJ, Loke RKS (1999) Designing syntactic pattern classifiers using vector quantization and parametric string editing. IEEE Trans Syst Man Cybern 29:881–888
34.
Zurück zum Zitat Perez-Cortes JC, Amengual JC, Arlandis J, Llobet R (2000) Stochastic error correcting parsing for ocr post-processing. In: International conference on pattern recognition ICPR-2000, Barcelona, 2000, pp 4405–4408 Perez-Cortes JC, Amengual JC, Arlandis J, Llobet R (2000) Stochastic error correcting parsing for ocr post-processing. In: International conference on pattern recognition ICPR-2000, Barcelona, 2000, pp 4405–4408
35.
Zurück zum Zitat Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun Assoc Comput Mach 23:676–687 Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun Assoc Comput Mach 23:676–687
36.
Zurück zum Zitat Risvik KM (2002) Search system and method for retrieval of data, and the use thereof in a search engine. United States Patent 6377945 B1, April 23 2002 Risvik KM (2002) Search system and method for retrieval of data, and the use thereof in a search engine. United States Patent 6377945 B1, April 23 2002
37.
Zurück zum Zitat Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading
38.
Zurück zum Zitat Schulz K, Mihov S (2002) Fast string correction with levenshtein-automata. Int J Doc Anal Recognit 5(1):67–85CrossRefMATH Schulz K, Mihov S (2002) Fast string correction with levenshtein-automata. Int J Doc Anal Recognit 5(1):67–85CrossRefMATH
39.
Zurück zum Zitat Shang H, Merrettal T (1996) Tries for approximate string matching. IEEE Trans Knowl Data Eng 8(4):540–547CrossRef Shang H, Merrettal T (1996) Tries for approximate string matching. IEEE Trans Knowl Data Eng 8(4):540–547CrossRef
40.
Zurück zum Zitat Stephen GA (1989) String searching. Prentice-Hall, Englewood Cliffs Stephen GA (1989) String searching. Prentice-Hall, Englewood Cliffs
41.
Zurück zum Zitat Stephen GA (2000) String searching algorithms, vol 6. Lecture Notes Series on Computing, World Scientific, Singapore Stephen GA (2000) String searching algorithms, vol 6. Lecture Notes Series on Computing, World Scientific, Singapore
43.
Zurück zum Zitat Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269CrossRefMATH Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269CrossRefMATH
44.
Zurück zum Zitat Wagner R, Fischer A (1974) The string-to-string correction problem. J Assoc Comput Mach 21:168–173MathSciNetMATH Wagner R, Fischer A (1974) The string-to-string correction problem. J Assoc Comput Mach 21:168–173MathSciNetMATH
45.
Metadaten
Titel
Breadth-first search strategies for trie-based syntactic pattern recognition
verfasst von
B. John Oommen
Ghada Badr
Publikationsdatum
01.02.2007
Verlag
Springer-Verlag
Erschienen in
Pattern Analysis and Applications / Ausgabe 1/2007
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-006-0032-z

Weitere Artikel der Ausgabe 1/2007

Pattern Analysis and Applications 1/2007 Zur Ausgabe

Premium Partner