article

Free Access

Fast and flexible string matching by combining bit-parallelism and suffix automata

Authors:
Gonzalo Navarro

Univ. of Chile, Chile

Univ. of Chile, Chile
View Profile

,
Mathieu Raffinot

Institut Gaspard Monge

Institut Gaspard Monge
View Profile

Authors Info & Claims

ACM Journal of Experimental Algorithmics Volume 5pp 4–eshttps://doi.org/10.1145/351827.384246

Published:31 December 2000Publication History

ACM Journal of Experimental Algorithmics

Abstract

The most important features of a string matching algorithm are its efficiency and its flexibility. Efficiency has traditionally received more attention, while flexibility in the search pattern is becoming a more and more important issue. Most classical string matching algorithms are aimed at quickly finding an exact pattern in a text, being Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM) family the most famous ones. A recent development uses deterministic "suffix automata" to design new optimal string matching algorithms, e.g. BDM and TurboBDM. Flexibility has been addressed quite separately by the use of "bit-parallelism", which simulates automata in their nondeterministic form by using bits and exploiting the intrinsic parallelism inside the computer word, e.g. the Shift-Or algorithm. Those algorithms are extended to handle classes of characters and errors in the pattern and/or in the text, their drawback being their inability to skip text characters. In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that area. As a theoretical development related to flexible pattern matching, we introduce a new automaton to recognize suffixes of patterns with classes of characters. To the best of our knowledge, this automaton has not been studied before.

Supplemental Material

Available for Download

tar

p4-navarro.tar (292.5 KB)

The software suite accompanying the article; this is a small Unix tar file, which includes the source code, a Makefile, and the test files used in the article.

vol5nbr4.ps (1.1 MB)

tar

vol5nbr4.tex.tar (1.3 MB)

References

ABRAHAMSON, K. 1987. Generalized string matching. SIAM Journal on Computing 16, 6, 1039-1051.]] Google ScholarDigital Library
BAEZA-YATES, R. 1992. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, Volume I (Sept. 1992), pp. 465-476. Elsevier Science.]] Google ScholarDigital Library
BAEZA-YATES, R. AND GONNET, G. 1992. A new approach to text searching. CACM 35, 10 (Oct.), 74-82.]] Google ScholarDigital Library
BAEZA-YATES, R. AND GONNET, G. 1994. Fast string matching with mismatches. Information and Computation 108, 2, 187-199.]] Google ScholarDigital Library
BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127-158.]]Google ScholarCross Ref
BAEZA-YATES, R. AND PERLEBERG, C. 1992. Fast and practical approximate pattern matching. In Proc. CPM'92 (1992), pp. 185-192. Springer-Verlag. LNCS 644.]] Google Scholar
BLUMER, A., EHRENFEUCHT, A., AND HAUSSLER, D. 1989. Average sizes of suffix trees and dawgs. Discrete Applied Mathematics 24, 1, 37-45.]]Google ScholarCross Ref
BOYER, R. S. AND MOORE, J. S. 1977. A fast string searching algorithm. Commun. ACM 20, 10, 762-772.]] Google ScholarDigital Library
CROCHEMORE, M. 1986. Transducers and repetitions. Theoretical Computer Science 45, 1, 63-86.]] Google ScholarDigital Library
CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1993. Fast practical multi-pattern matching. Rapport 93-3, Institut Gaspard Monge, Universiteacute; de Marne la Valleacute;e.]]Google Scholar
CROCHEMORE, M. AND RYTTER, W. 1994. Text algorithms. Oxford University Press.]] Google Scholar
CZUMAJ, A., CROCHEMORE, M., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Speeding up two string-matching algorithms. Algorithmica 12, 247-267.]]Google ScholarCross Ref
EL-MABROUK, N. AND CROCHEMORE, M. 1996. Boyer-Moore strategy to efficient approximate string matching. In Proc. of CPM'96, Number 1075 in Lecture Notes in Computer Science (1996), pp. 24-38. Springer-Verlag, Berlin.]] Google Scholar
FISCHER, M. J. AND PATERSON, M. 1974. String matching and other products. In R. M. KARP Ed., Proceedings SIAM-AMS Complexity of Computation (Providence, RI, 1974), pp. 113-125.]]Google Scholar
HORSPOOL, R.N. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501-506.]]Google ScholarCross Ref
JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. A comparison of approximate string matching algorithms. Software Practice and Experience 26, 12, 1439-1458.]] Google ScholarDigital Library
KNUTH, D. E., MORRIS, J. H., AND PRATT, V.R. 1977. Fast pattern matching in strings. SIAM Journal on Computing 6, 1, 323-350.]]Google ScholarCross Ref
LECROQ, T. 1992. Recherches de mot. Ph.D. thesis, Universiteacute; d'Orleacute;ans, France.]]Google Scholar
MYERS, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46, 3, 395-415.]] Google ScholarDigital Library
NAVARRO, G. 1997. A partial deterministic automaton for approximate string matching. In Proc. of WSP'97 (1997), pp. 112-124. Carleton University Press.]]Google Scholar
NAVARRO, G. 1998. Approximate Text Searching. Ph.D. thesis, Department of Computer Science, University of Chile. Tech. Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/- pub/users/gnavarro/thesis98.ps.gz.]]Google Scholar
NAVARRO, G. 2000a. A guided tour to approximate string matching. ACM Computing Surveys . To appear.]] Google Scholar
NAVARRO, G. 2000b. Nrgrep: a fast and flexible pattern matching tool. Technical Report TR/DCC-2000-3 (Aug.), Dept. of Computer Science, Univ. of Chile. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/nrgrep.ps.gz.]]Google Scholar
NAVARRO, G. AND BAEZA-YATES, R. 1999. Very fast and simple approximate string matching. Information Processing Letters 72, 65-70.]] Google ScholarDigital Library
NAVARRO, G. AND RAFFINOT, M. 1998. A bit-parallel approach to suffix automata: Fast extended string matching. In Proc. CPM'98, LNCS v. 1448 (1998), pp. 14-33. Springer-Verlag.]] Google Scholar
NAVARRO, G. AND RAFFINOT, M. 1999. Fast regular expression search. In J. VITTER AND C. ZAROLIAGIS Eds., Proceedings of the 3rd Workshop on Algorithm Engineering, Number 1668 in Lecture Notes in Computer Science (1999), pp. 199-213. Springer-Verlag, Berlin.]] Google Scholar
NAVARRO, G AND RAFFINOT, M. 2000. Fast and simple character classes and bounded gaps pattern matching, with application to protein searching. In ACM RECOMB 2001 (2000). To appear.]] Google Scholar
PINTER, R.Y. 1985. Efficient string matching with don't care pattern. In A. APOSTOLICO AND Z. GALIL Eds., Combinatorial Algorithms on Words, pp. 11-29. Springer-Verlag, Berlin.]]Google Scholar
RAFFINOT, M. 1997a. Asymptotic estimation of the average number of terminal states in dawgs. In Proc. WSP'97 (1997), pp. 140-148. Carleton University Press.]]Google Scholar
RAFFINOT, M. 1997b. On the multi backward dawg matching algorithm (MultiBDM). In Proc. WSP'97 (1997), pp. 149-165. Carleton University Press.]]Google Scholar
SUNDAY, D. 1990. A very fast substring search algorithm. CACM 33, 8 (Aug.), 132-142.]] Google ScholarDigital Library
TARHIO, J. AND UKKONEN, E. 1993. Approximate Boyer-Moore string matching. SIAM J. on Computing 22, 2, 243-260.]] Google ScholarDigital Library
WU, S. AND MANBER, U. 1992. Fast text searching allowing errors. CACM 35, 10 (Oct.), 83-91.]] Google ScholarDigital Library
WU, S., MANBER, U., AND MYERS, E. 1996. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica 15, 1, 50-67.]]Google ScholarDigital Library
YAO, A. 1979. The complexity of pattern matching for a random string. SIAM J. on Computing 8, 368-387.]]Google ScholarCross Ref

Index Terms

Fast and flexible string matching by combining bit-parallelism and suffix automata
1. Theory of computation

Recommendations

A fast bit-parallel multi-patterns string matching algorithm for biological sequences
ISB '10: Proceedings of the International Symposium on Biocomputing

The problem of searching occurrences of a pattern P[0...m-1] in the text T[0...n-1>with m ≤ n, where the symbols of P and T are drawn from some alphabet Σ of size σ, is called exact string matching problem. In the present day, pattern matching is a ...
Read More
Efficient string matching on graphics processing unit using bit-parallelism

String matching and regular expression matching have many applications in search engines, antivirus systems, intrusion detection systems, and others. Bit-parallelism is a widely used technique in the design of efficient string matching algorithms. In ...
Read More
Approximate string matching with suffix automata

Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Journal of Experimental Algorithmics Volume 5, Issue
2000
418 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/351827
Editor:
Bernard M. E. Moret
Issue’s Table of Contents
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 December 2000
Published in jea Volume 5, Issue
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 96
  Total Citations
  View Citations
- 1,119
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast and flexible string matching by combining bit-parallelism and suffix automata

ACM Journal of Experimental Algorithmics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

A fast bit-parallel multi-patterns string matching algorithm for biological sequences

Efficient string matching on graphics processing unit using bit-parallelism

Approximate string matching with suffix automata

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast and flexible string matching by combining bit-parallelism and suffix automata

ACM Journal of Experimental Algorithmics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

A fast bit-parallel multi-patterns string matching algorithm for biological sequences

Efficient string matching on graphics processing unit using bit-parallelism

Approximate string matching with suffix automata

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media