Abstract
The problem of finding matches of a regular expression (RE) on a string exists in many applications, such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this article, we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We present a detailed description of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real datasets, including DNA sequences, proteins, and text documents, and show significant performance improvement of the state-of-the-art tools by an order of magnitude.
- Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1985. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Sean E. Anderson. 2005. Bit twiddling hacks. http://graphics.stanford.edu/∼seander/bithacks.html.Google Scholar
- Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82. Google ScholarDigital Library
- Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1996. Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM 43, 6, 915--936. Google ScholarDigital Library
- Gérard Berry and Ravi Sethi. 1986. From regular expressions to deterministic automata. Theoretical Computer Science 48, 3, 117--126. Google ScholarDigital Library
- Anselm Blumer, J. Blumer, David Haussler, Ross M. McConnell, and Andrzej Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM 34, 3, 578--595. Google ScholarDigital Library
- Anne Brüggemann-Klein. 1993. Regular expressions into finite automata. Theoretical Computer Science 120, 2, 197--213. Google ScholarDigital Library
- Chia-Hsiang Chang and Robert Paige. 1997. From regular expressions to DFA’s using compressed NFA’s. Theoretical Computer Science 178, 1--2, 1--36. Google ScholarDigital Library
- Beate Commentz-Walter. 1979. A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium on Automata, Languages and Programming, Hermann A. Maurer (Ed.), Lecture Notes in Computer Science, Vol. 71. Springer-Verlag, Graz, Austria, 118--132. Google ScholarDigital Library
- M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. 1994. Speeding up two strings matching algorithms. Algorithmica 12, 4, 247--267.Google ScholarCross Ref
- Simone Faro and Thierry Lecroq. 2013. The exact online string matching problem: A review of the most recent results. ACM Computing Surveys 45, 2 (2013), 94--111. Google ScholarDigital Library
- V.-M. Glushkov. 1961. The abstract theory of automata. Russian Mathematical Surveys 16, 5, 1--53. DOI:http://dx.doi.org/10.1070/RM1961v016n05ABEH004112Google ScholarCross Ref
- David Harel. 1999. Factor Oracle of a Set of Words. Technical report. Université de Marne-la-Vallée.Google Scholar
- John E. Hopcroft and Jeffrey D. Ullman. 2000. Introduction to Automata Theory, Languages and Computation, Second Edition (2nd ed.). Addison-Wesley, Reading, MA. Google ScholarDigital Library
- L. F. Kolakowski, J. Leunissen, and J. E. Smith. 1992. Prosearch: Fast searching of protein sequences with regular expression patterns related to protein structure and function. BioTechniques 13, 6, 919--921.Google Scholar
- T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu. 2008. Compressed indexing and local alignment of DNA. Bioinformatics 24, 6, 791--797. Google ScholarDigital Library
- Mehryar Mohri. 1997. String matching with automata. Nordic Journal of Computing 4, 2, 217--231.Google Scholar
- Eugene W. Myers. 1992. A four Russians algorithm for regular expression pattern matching. Journal of the ACM 39, 2, 430--448. Google ScholarDigital Library
- Gonzalo Navarro. 2001. NR-grep: A fast and flexible pattern matching tool. Software: Practice and Experience 31, 13, 1265--1312. Google ScholarDigital Library
- Gonzalo Navarro and Mathieu Raffinot. 1999. Fast regular expression search. In Proceedings of the Algorithm Engineering, 3rd International Workshop on Algorithm Engineering (WAE’99), Jeffrey Scott Vitter and Christos D. Zaroliagis (Eds.), Lecture Notes in Computer Science, Vol. 1668. Springer-Verlag, London, 198--212. Google ScholarDigital Library
- Gonzalo Navarro and Mathieu Raffinot. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics 5, 4. Google ScholarDigital Library
- Gonzalo Navarro and Mathieu Raffinot. 2001. Compact DFA representation for fast regular expression search. In Algorithm Engineering, 5th International Workshop, WAE 2001, Proceedings (Lecture Notes in Computer Science), Gerth Stølting Brodal, Daniele Frigioni, and Alberto Marchetti-Spaccamela (Eds.), Vol. 2141. Springer-Verlag, Aarhus, Denmark, 1--12. Google ScholarDigital Library
- Gonzalo Navarro and Mathieu Raffinot. 2004. New techniques for regular expression searching. Algorithmica 41, 2, 89--116.Google ScholarCross Ref
- Milan Simánek. 1998. The factor automaton. In Proceedings of the Prague Stringology Club Workshop, Jan Holub and Milan Simánek (Eds.). Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Prague, Czech Republic, 102--106.Google Scholar
- R. Staden. 1991. Screening protein and nucleic acid sequences against libraries of patterns. DNA Sequence 1, 6, 367--374.Google ScholarCross Ref
- Ken Thompson. 1968. Regular expression search algorithm. Communications of the ACM 11, 6, 419--422. DOI:http://dx.doi.org/10.1145/363347.363387 Google ScholarDigital Library
- Noriyoshi Uratani and Masayuki Takeda. 1993. A fast string-searching algorithm for multiple patterns. Information Processing and Management 29, 6, 775--792. Google ScholarDigital Library
- Bruce W. Watson. 2003. A new regular grammar pattern matching algorithm. Theoretical Computer Science 1--3, 299, 509--521. Google ScholarDigital Library
- Sun Wu and Udi Manber. 1992. Fast text searching allowing errors. Communications of the ACM 35, 10, 83--91. Google ScholarDigital Library
- Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, and Chen Li. 2013. Improving regular-expression matching on strings using negative factors. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, New York, NY, 361--372. Google ScholarDigital Library
Index Terms
- Negative Factor: Improving Regular-Expression Matching in Strings
Recommendations
Improving regular-expression matching on strings using negative factors
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataThe problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each ...
Parameterized longest previous factor
Given a string W, the longest previous factor (LPF) problem is to determine the maximum length of a previously occurring factor for each suffix occurring in W. The LPF problem is defined for traditional strings exclusively from the constant alphabet @S. ...
Parameterized longest previous factor
IWOCA'11: Proceedings of the 22nd international conference on Combinatorial AlgorithmsThe longest previous factor (LPF) problem is defined for traditional strings exclusively from the constant alphabet Σ. A parameterized string (p-string) is a sophisticated string composed of symbols from a constant alphabet Σ and a parameter alphabet Π. ...
Comments