research-article

Negative Factor: Improving Regular-Expression Matching in Strings

Authors:
Xiaochun Yang

Northeastern University, China

Northeastern University, China

0000-0002-6184-4771
View Profile

,
Tao Qiu

Northeastern University, China

Northeastern University, China
View Profile

,
Bin Wang

Northeastern University, China

Northeastern University, China
View Profile

,
Baihua Zheng

Singapore Management University, Singapore

Singapore Management University, Singapore
View Profile

,
Yaoshu Wang

Northeastern University, China

Northeastern University, China
View Profile

,
Chen Li

University of California, Irvine

University of California, Irvine
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 40 Issue 4Article No.: 25pp 1–46https://doi.org/10.1145/2847525

Published:20 January 2016Publication History

ACM Transactions on Database Systems

Abstract

The problem of finding matches of a regular expression (RE) on a string exists in many applications, such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this article, we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We present a detailed description of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real datasets, including DNA sequences, proteins, and text documents, and show significant performance improvement of the state-of-the-art tools by an order of magnitude.

References

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1985. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. Google ScholarDigital Library
Sean E. Anderson. 2005. Bit twiddling hacks. http://graphics.stanford.edu/&sim;seander/bithacks.html.Google Scholar
Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82. Google ScholarDigital Library
Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1996. Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM 43, 6, 915--936. Google ScholarDigital Library
Gérard Berry and Ravi Sethi. 1986. From regular expressions to deterministic automata. Theoretical Computer Science 48, 3, 117--126. Google ScholarDigital Library
Anselm Blumer, J. Blumer, David Haussler, Ross M. McConnell, and Andrzej Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM 34, 3, 578--595. Google ScholarDigital Library
Anne Brüggemann-Klein. 1993. Regular expressions into finite automata. Theoretical Computer Science 120, 2, 197--213. Google ScholarDigital Library
Chia-Hsiang Chang and Robert Paige. 1997. From regular expressions to DFA’s using compressed NFA’s. Theoretical Computer Science 178, 1--2, 1--36. Google ScholarDigital Library
Beate Commentz-Walter. 1979. A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium on Automata, Languages and Programming, Hermann A. Maurer (Ed.), Lecture Notes in Computer Science, Vol. 71. Springer-Verlag, Graz, Austria, 118--132. Google ScholarDigital Library
M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. 1994. Speeding up two strings matching algorithms. Algorithmica 12, 4, 247--267.Google ScholarCross Ref
Simone Faro and Thierry Lecroq. 2013. The exact online string matching problem: A review of the most recent results. ACM Computing Surveys 45, 2 (2013), 94--111. Google ScholarDigital Library
V.-M. Glushkov. 1961. The abstract theory of automata. Russian Mathematical Surveys 16, 5, 1--53. DOI:http://dx.doi.org/10.1070/RM1961v016n05ABEH004112Google ScholarCross Ref
David Harel. 1999. Factor Oracle of a Set of Words. Technical report. Université de Marne-la-Vallée.Google Scholar
John E. Hopcroft and Jeffrey D. Ullman. 2000. Introduction to Automata Theory, Languages and Computation, Second Edition (2nd ed.). Addison-Wesley, Reading, MA. Google ScholarDigital Library
L. F. Kolakowski, J. Leunissen, and J. E. Smith. 1992. Prosearch: Fast searching of protein sequences with regular expression patterns related to protein structure and function. BioTechniques 13, 6, 919--921.Google Scholar
T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu. 2008. Compressed indexing and local alignment of DNA. Bioinformatics 24, 6, 791--797. Google ScholarDigital Library
Mehryar Mohri. 1997. String matching with automata. Nordic Journal of Computing 4, 2, 217--231.Google Scholar
Eugene W. Myers. 1992. A four Russians algorithm for regular expression pattern matching. Journal of the ACM 39, 2, 430--448. Google ScholarDigital Library
Gonzalo Navarro. 2001. NR-grep: A fast and flexible pattern matching tool. Software: Practice and Experience 31, 13, 1265--1312. Google ScholarDigital Library
Gonzalo Navarro and Mathieu Raffinot. 1999. Fast regular expression search. In Proceedings of the Algorithm Engineering, 3rd International Workshop on Algorithm Engineering (WAE’99), Jeffrey Scott Vitter and Christos D. Zaroliagis (Eds.), Lecture Notes in Computer Science, Vol. 1668. Springer-Verlag, London, 198--212. Google ScholarDigital Library
Gonzalo Navarro and Mathieu Raffinot. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics 5, 4. Google ScholarDigital Library
Gonzalo Navarro and Mathieu Raffinot. 2001. Compact DFA representation for fast regular expression search. In Algorithm Engineering, 5th International Workshop, WAE 2001, Proceedings (Lecture Notes in Computer Science), Gerth Stølting Brodal, Daniele Frigioni, and Alberto Marchetti-Spaccamela (Eds.), Vol. 2141. Springer-Verlag, Aarhus, Denmark, 1--12. Google ScholarDigital Library
Gonzalo Navarro and Mathieu Raffinot. 2004. New techniques for regular expression searching. Algorithmica 41, 2, 89--116.Google ScholarCross Ref
Milan Simánek. 1998. The factor automaton. In Proceedings of the Prague Stringology Club Workshop, Jan Holub and Milan Simánek (Eds.). Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Prague, Czech Republic, 102--106.Google Scholar
R. Staden. 1991. Screening protein and nucleic acid sequences against libraries of patterns. DNA Sequence 1, 6, 367--374.Google ScholarCross Ref
Ken Thompson. 1968. Regular expression search algorithm. Communications of the ACM 11, 6, 419--422. DOI:http://dx.doi.org/10.1145/363347.363387 Google ScholarDigital Library
Noriyoshi Uratani and Masayuki Takeda. 1993. A fast string-searching algorithm for multiple patterns. Information Processing and Management 29, 6, 775--792. Google ScholarDigital Library
Bruce W. Watson. 2003. A new regular grammar pattern matching algorithm. Theoretical Computer Science 1--3, 299, 509--521. Google ScholarDigital Library
Sun Wu and Udi Manber. 1992. Fast text searching allowing errors. Communications of the ACM 35, 10, 83--91. Google ScholarDigital Library
Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, and Chen Li. 2013. Improving regular-expression matching on strings using negative factors. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, New York, NY, 361--372. Google ScholarDigital Library

Index Terms

Negative Factor: Improving Regular-Expression Matching in Strings
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Structure and multilingual text search
        Structured text search
2. Theory of computation
  1. Formal languages and automata theory
    1. Regular languages

Recommendations

Improving regular-expression matching on strings using negative factors
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each ...
Read More
Parameterized longest previous factor

Given a string W, the longest previous factor (LPF) problem is to determine the maximum length of a previously occurring factor for each suffix occurring in W. The LPF problem is defined for traditional strings exclusively from the constant alphabet @S. ...
Read More
Parameterized longest previous factor
IWOCA'11: Proceedings of the 22nd international conference on Combinatorial Algorithms

The longest previous factor (LPF) problem is defined for traditional strings exclusively from the constant alphabet Σ. A parameterized string (p-string) is a sophisticated string composed of symbols from a constant alphabet Σ and a parameter alphabet Π. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 40, Issue 4
Special Issue: Invited 2014 PODS and EDBT Revised Articles
February 2016
248 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2866579
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 January 2016
- Accepted: 1 September 2015
- Revised: 1 August 2015
- Received: 1 January 2015
Published in tods Volume 40, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Regular expression
long sequence
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 448
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Negative Factor: Improving Regular-Expression Matching in Strings

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Improving regular-expression matching on strings using negative factors

Parameterized longest previous factor

Parameterized longest previous factor

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Negative Factor: Improving Regular-Expression Matching in Strings

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Improving regular-expression matching on strings using negative factors

Parameterized longest previous factor

Parameterized longest previous factor

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media