skip to main content
article
Free Access

Fast and flexible string matching by combining bit-parallelism and suffix automata

Published:31 December 2000Publication History
Skip Abstract Section

Abstract

The most important features of a string matching algorithm are its efficiency and its flexibility. Efficiency has traditionally received more attention, while flexibility in the search pattern is becoming a more and more important issue. Most classical string matching algorithms are aimed at quickly finding an exact pattern in a text, being Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM) family the most famous ones. A recent development uses deterministic "suffix automata" to design new optimal string matching algorithms, e.g. BDM and TurboBDM. Flexibility has been addressed quite separately by the use of "bit-parallelism", which simulates automata in their nondeterministic form by using bits and exploiting the intrinsic parallelism inside the computer word, e.g. the Shift-Or algorithm. Those algorithms are extended to handle classes of characters and errors in the pattern and/or in the text, their drawback being their inability to skip text characters. In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that area. As a theoretical development related to flexible pattern matching, we introduce a new automaton to recognize suffixes of patterns with classes of characters. To the best of our knowledge, this automaton has not been studied before.

Skip Supplemental Material Section

Supplemental Material

References

  1. ABRAHAMSON, K. 1987. Generalized string matching. SIAM Journal on Computing 16, 6, 1039-1051.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. BAEZA-YATES, R. 1992. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, Volume I (Sept. 1992), pp. 465-476. Elsevier Science.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. BAEZA-YATES, R. AND GONNET, G. 1992. A new approach to text searching. CACM 35, 10 (Oct.), 74-82.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BAEZA-YATES, R. AND GONNET, G. 1994. Fast string matching with mismatches. Information and Computation 108, 2, 187-199.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127-158.]]Google ScholarGoogle ScholarCross RefCross Ref
  6. BAEZA-YATES, R. AND PERLEBERG, C. 1992. Fast and practical approximate pattern matching. In Proc. CPM'92 (1992), pp. 185-192. Springer-Verlag. LNCS 644.]] Google ScholarGoogle Scholar
  7. BLUMER, A., EHRENFEUCHT, A., AND HAUSSLER, D. 1989. Average sizes of suffix trees and dawgs. Discrete Applied Mathematics 24, 1, 37-45.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. BOYER, R. S. AND MOORE, J. S. 1977. A fast string searching algorithm. Commun. ACM 20, 10, 762-772.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. CROCHEMORE, M. 1986. Transducers and repetitions. Theoretical Computer Science 45, 1, 63-86.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1993. Fast practical multi-pattern matching. Rapport 93-3, Institut Gaspard Monge, Universiteacute; de Marne la Valleacute;e.]]Google ScholarGoogle Scholar
  11. CROCHEMORE, M. AND RYTTER, W. 1994. Text algorithms. Oxford University Press.]] Google ScholarGoogle Scholar
  12. CZUMAJ, A., CROCHEMORE, M., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Speeding up two string-matching algorithms. Algorithmica 12, 247-267.]]Google ScholarGoogle ScholarCross RefCross Ref
  13. EL-MABROUK, N. AND CROCHEMORE, M. 1996. Boyer-Moore strategy to efficient approximate string matching. In Proc. of CPM'96, Number 1075 in Lecture Notes in Computer Science (1996), pp. 24-38. Springer-Verlag, Berlin.]] Google ScholarGoogle Scholar
  14. FISCHER, M. J. AND PATERSON, M. 1974. String matching and other products. In R. M. KARP Ed., Proceedings SIAM-AMS Complexity of Computation (Providence, RI, 1974), pp. 113-125.]]Google ScholarGoogle Scholar
  15. HORSPOOL, R.N. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501-506.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. A comparison of approximate string matching algorithms. Software Practice and Experience 26, 12, 1439-1458.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. KNUTH, D. E., MORRIS, J. H., AND PRATT, V.R. 1977. Fast pattern matching in strings. SIAM Journal on Computing 6, 1, 323-350.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. LECROQ, T. 1992. Recherches de mot. Ph.D. thesis, Universiteacute; d'Orleacute;ans, France.]]Google ScholarGoogle Scholar
  19. MYERS, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46, 3, 395-415.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. NAVARRO, G. 1997. A partial deterministic automaton for approximate string matching. In Proc. of WSP'97 (1997), pp. 112-124. Carleton University Press.]]Google ScholarGoogle Scholar
  21. NAVARRO, G. 1998. Approximate Text Searching. Ph.D. thesis, Department of Computer Science, University of Chile. Tech. Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/- pub/users/gnavarro/thesis98.ps.gz.]]Google ScholarGoogle Scholar
  22. NAVARRO, G. 2000a. A guided tour to approximate string matching. ACM Computing Surveys . To appear.]] Google ScholarGoogle Scholar
  23. NAVARRO, G. 2000b. Nrgrep: a fast and flexible pattern matching tool. Technical Report TR/DCC-2000-3 (Aug.), Dept. of Computer Science, Univ. of Chile. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/nrgrep.ps.gz.]]Google ScholarGoogle Scholar
  24. NAVARRO, G. AND BAEZA-YATES, R. 1999. Very fast and simple approximate string matching. Information Processing Letters 72, 65-70.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NAVARRO, G. AND RAFFINOT, M. 1998. A bit-parallel approach to suffix automata: Fast extended string matching. In Proc. CPM'98, LNCS v. 1448 (1998), pp. 14-33. Springer-Verlag.]] Google ScholarGoogle Scholar
  26. NAVARRO, G. AND RAFFINOT, M. 1999. Fast regular expression search. In J. VITTER AND C. ZAROLIAGIS Eds., Proceedings of the 3rd Workshop on Algorithm Engineering, Number 1668 in Lecture Notes in Computer Science (1999), pp. 199-213. Springer-Verlag, Berlin.]] Google ScholarGoogle Scholar
  27. NAVARRO, G AND RAFFINOT, M. 2000. Fast and simple character classes and bounded gaps pattern matching, with application to protein searching. In ACM RECOMB 2001 (2000). To appear.]] Google ScholarGoogle Scholar
  28. PINTER, R.Y. 1985. Efficient string matching with don't care pattern. In A. APOSTOLICO AND Z. GALIL Eds., Combinatorial Algorithms on Words, pp. 11-29. Springer-Verlag, Berlin.]]Google ScholarGoogle Scholar
  29. RAFFINOT, M. 1997a. Asymptotic estimation of the average number of terminal states in dawgs. In Proc. WSP'97 (1997), pp. 140-148. Carleton University Press.]]Google ScholarGoogle Scholar
  30. RAFFINOT, M. 1997b. On the multi backward dawg matching algorithm (MultiBDM). In Proc. WSP'97 (1997), pp. 149-165. Carleton University Press.]]Google ScholarGoogle Scholar
  31. SUNDAY, D. 1990. A very fast substring search algorithm. CACM 33, 8 (Aug.), 132-142.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. TARHIO, J. AND UKKONEN, E. 1993. Approximate Boyer-Moore string matching. SIAM J. on Computing 22, 2, 243-260.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. WU, S. AND MANBER, U. 1992. Fast text searching allowing errors. CACM 35, 10 (Oct.), 83-91.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. WU, S., MANBER, U., AND MYERS, E. 1996. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica 15, 1, 50-67.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. YAO, A. 1979. The complexity of pattern matching for a random string. SIAM J. on Computing 8, 368-387.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Fast and flexible string matching by combining bit-parallelism and suffix automata

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Journal of Experimental Algorithmics
              ACM Journal of Experimental Algorithmics  Volume 5, Issue
              2000
              418 pages
              ISSN:1084-6654
              EISSN:1084-6654
              DOI:10.1145/351827
              Issue’s Table of Contents

              Copyright © 2000 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 31 December 2000
              Published in jea Volume 5, Issue

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader