Skip to main content
Top

2017 | OriginalPaper | Chapter

Efficient Regular Expression Matching on Compressed Strings

Authors : Yutong Han, Bin Wang, Xiaochun Yang, Huaijie Zhu

Published in: Database Systems for Advanced Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Existing methods for regular expression matching on LZ78 compressed strings do not perform efficiently. Moreover, LZ78 compression has some shortcomings, such as high compression ratio and slower decompression speed than LZ77 (a variant of LZ78). In this paper, we study regular expression matching on LZ77 compressed strings. To address this problem, we propose an efficient algorithm, namely, RELZ, utilizing the positive factors, i.e., a prefix and a suffix, and negative factors (Negative factors are substrings that cannot appear in an answer.) of the regular expression to prune the candidates. For the sake of quickly locating these two kinds of factors on the compressed string without decompression, we design a variant suffix trie index, called SSLZ. In addition, we construct bitmaps for factors of regular expression to detect potential region and propose block filtering to reduce candidates. At last, we conduct a comprehensive performance evaluation using five real datasets to validate our ideas and the proposed algorithms. The experimental result shows that our RELZ algorithm outperforms the existing algorithms significantly.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
\(select_{b}(B,\; i) \text { is the position of } i\text {-th bit} b \text {matchings in } B \).
 
Literature
1.
go back to reference Bille, P., Fagerberg, R., Gortz, I.L.: Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts. In: Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, pp. 52–62 (2007) Bille, P., Fagerberg, R., Gortz, I.L.: Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts. In: Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, pp. 52–62 (2007)
2.
go back to reference Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). doi:10.1007/978-3-642-54423-1_63 CrossRef Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). doi:10.​1007/​978-3-642-54423-1_​63 CrossRef
3.
go back to reference Gonzlez, R., Grabowski, S., Mkinen, V., Navarro, G.: Practical implementation of rank and select queries, pp. 27–38 (2005) Gonzlez, R., Grabowski, S., Mkinen, V., Navarro, G.: Practical implementation of rank and select queries, pp. 27–38 (2005)
5.
go back to reference Li, Z., Wang, H., Shao, W., Li, J., Gao, H.: Repairing data through regular expressions. Proc. VLDB Endow. 9(5), 432–443 (2016)CrossRef Li, Z., Wang, H., Shao, W., Li, J., Gao, H.: Repairing data through regular expressions. Proc. VLDB Endow. 9(5), 432–443 (2016)CrossRef
6.
go back to reference Navarro, G.: NR-grep: a fast and flexible pattern-matching tool. Softw. Pract. Exp. 31(13), 1265–1312 (2001)CrossRefMATH Navarro, G.: NR-grep: a fast and flexible pattern-matching tool. Softw. Pract. Exp. 31(13), 1265–1312 (2001)CrossRefMATH
9.
go back to reference Navarro, G., Raffinot, M.: Compact DFA representation for fast regular expression search. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds.) WAE 2001. LNCS, vol. 2141, pp. 1–13. Springer, Heidelberg (2001). doi:10.1007/3-540-44688-5_1 CrossRef Navarro, G., Raffinot, M.: Compact DFA representation for fast regular expression search. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds.) WAE 2001. LNCS, vol. 2141, pp. 1–13. Springer, Heidelberg (2001). doi:10.​1007/​3-540-44688-5_​1 CrossRef
10.
go back to reference Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)CrossRef Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)CrossRef
11.
go back to reference Thormpson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)CrossRef Thormpson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)CrossRef
12.
go back to reference Wu, S.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)CrossRef Wu, S.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)CrossRef
13.
go back to reference Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)MathSciNetCrossRef Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)MathSciNetCrossRef
14.
go back to reference Yang, X., Wang, B., Li, C., Wang, J.: Efficient direct search on compressed genomic data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 961–972 (2013) Yang, X., Wang, B., Li, C., Wang, J.: Efficient direct search on compressed genomic data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 961–972 (2013)
15.
go back to reference Yang, X., Wang, B., Qiu, T., Wang, Y., Li, C.: Improving regular-expression matching on strings using negative factors. In: ACM SIGMOD International Conference on Management of Data, pp. 361–372 (2013) Yang, X., Wang, B., Qiu, T., Wang, Y., Li, C.: Improving regular-expression matching on strings using negative factors. In: ACM SIGMOD International Conference on Management of Data, pp. 361–372 (2013)
16.
go back to reference Zhang, M., Zhang, Y., Hou, C.: Compact representations of automata for regular expression matching. Inf. Process. Lett. 116(12), 750–756 (2016)MathSciNetCrossRefMATH Zhang, M., Zhang, Y., Hou, C.: Compact representations of automata for regular expression matching. Inf. Process. Lett. 116(12), 750–756 (2016)MathSciNetCrossRefMATH
17.
Metadata
Title
Efficient Regular Expression Matching on Compressed Strings
Authors
Yutong Han
Bin Wang
Xiaochun Yang
Huaijie Zhu
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-55699-4_14

Premium Partner