skip to main content
research-article

Negative Factor: Improving Regular-Expression Matching in Strings

Published:20 January 2016Publication History
Skip Abstract Section

Abstract

The problem of finding matches of a regular expression (RE) on a string exists in many applications, such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this article, we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We present a detailed description of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real datasets, including DNA sequences, proteins, and text documents, and show significant performance improvement of the state-of-the-art tools by an order of magnitude.

References

  1. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1985. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sean E. Anderson. 2005. Bit twiddling hacks. http://graphics.stanford.edu/∼seander/bithacks.html.Google ScholarGoogle Scholar
  3. Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1996. Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM 43, 6, 915--936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gérard Berry and Ravi Sethi. 1986. From regular expressions to deterministic automata. Theoretical Computer Science 48, 3, 117--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anselm Blumer, J. Blumer, David Haussler, Ross M. McConnell, and Andrzej Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM 34, 3, 578--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anne Brüggemann-Klein. 1993. Regular expressions into finite automata. Theoretical Computer Science 120, 2, 197--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chia-Hsiang Chang and Robert Paige. 1997. From regular expressions to DFA’s using compressed NFA’s. Theoretical Computer Science 178, 1--2, 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Beate Commentz-Walter. 1979. A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium on Automata, Languages and Programming, Hermann A. Maurer (Ed.), Lecture Notes in Computer Science, Vol. 71. Springer-Verlag, Graz, Austria, 118--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. 1994. Speeding up two strings matching algorithms. Algorithmica 12, 4, 247--267.Google ScholarGoogle ScholarCross RefCross Ref
  11. Simone Faro and Thierry Lecroq. 2013. The exact online string matching problem: A review of the most recent results. ACM Computing Surveys 45, 2 (2013), 94--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V.-M. Glushkov. 1961. The abstract theory of automata. Russian Mathematical Surveys 16, 5, 1--53. DOI:http://dx.doi.org/10.1070/RM1961v016n05ABEH004112Google ScholarGoogle ScholarCross RefCross Ref
  13. David Harel. 1999. Factor Oracle of a Set of Words. Technical report. Université de Marne-la-Vallée.Google ScholarGoogle Scholar
  14. John E. Hopcroft and Jeffrey D. Ullman. 2000. Introduction to Automata Theory, Languages and Computation, Second Edition (2nd ed.). Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. F. Kolakowski, J. Leunissen, and J. E. Smith. 1992. Prosearch: Fast searching of protein sequences with regular expression patterns related to protein structure and function. BioTechniques 13, 6, 919--921.Google ScholarGoogle Scholar
  16. T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu. 2008. Compressed indexing and local alignment of DNA. Bioinformatics 24, 6, 791--797. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mehryar Mohri. 1997. String matching with automata. Nordic Journal of Computing 4, 2, 217--231.Google ScholarGoogle Scholar
  18. Eugene W. Myers. 1992. A four Russians algorithm for regular expression pattern matching. Journal of the ACM 39, 2, 430--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gonzalo Navarro. 2001. NR-grep: A fast and flexible pattern matching tool. Software: Practice and Experience 31, 13, 1265--1312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gonzalo Navarro and Mathieu Raffinot. 1999. Fast regular expression search. In Proceedings of the Algorithm Engineering, 3rd International Workshop on Algorithm Engineering (WAE’99), Jeffrey Scott Vitter and Christos D. Zaroliagis (Eds.), Lecture Notes in Computer Science, Vol. 1668. Springer-Verlag, London, 198--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gonzalo Navarro and Mathieu Raffinot. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics 5, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gonzalo Navarro and Mathieu Raffinot. 2001. Compact DFA representation for fast regular expression search. In Algorithm Engineering, 5th International Workshop, WAE 2001, Proceedings (Lecture Notes in Computer Science), Gerth Stølting Brodal, Daniele Frigioni, and Alberto Marchetti-Spaccamela (Eds.), Vol. 2141. Springer-Verlag, Aarhus, Denmark, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gonzalo Navarro and Mathieu Raffinot. 2004. New techniques for regular expression searching. Algorithmica 41, 2, 89--116.Google ScholarGoogle ScholarCross RefCross Ref
  24. Milan Simánek. 1998. The factor automaton. In Proceedings of the Prague Stringology Club Workshop, Jan Holub and Milan Simánek (Eds.). Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Prague, Czech Republic, 102--106.Google ScholarGoogle Scholar
  25. R. Staden. 1991. Screening protein and nucleic acid sequences against libraries of patterns. DNA Sequence 1, 6, 367--374.Google ScholarGoogle ScholarCross RefCross Ref
  26. Ken Thompson. 1968. Regular expression search algorithm. Communications of the ACM 11, 6, 419--422. DOI:http://dx.doi.org/10.1145/363347.363387 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Noriyoshi Uratani and Masayuki Takeda. 1993. A fast string-searching algorithm for multiple patterns. Information Processing and Management 29, 6, 775--792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Bruce W. Watson. 2003. A new regular grammar pattern matching algorithm. Theoretical Computer Science 1--3, 299, 509--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sun Wu and Udi Manber. 1992. Fast text searching allowing errors. Communications of the ACM 35, 10, 83--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, and Chen Li. 2013. Improving regular-expression matching on strings using negative factors. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, New York, NY, 361--372. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Negative Factor: Improving Regular-Expression Matching in Strings

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Database Systems
        ACM Transactions on Database Systems  Volume 40, Issue 4
        Special Issue: Invited 2014 PODS and EDBT Revised Articles
        February 2016
        248 pages
        ISSN:0362-5915
        EISSN:1557-4644
        DOI:10.1145/2866579
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 January 2016
        • Accepted: 1 September 2015
        • Revised: 1 August 2015
        • Received: 1 January 2015
        Published in tods Volume 40, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader