Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

Crochemore, Maxime; Héliou, Alice; Kucherov, Gregory; Mouchard, Laurent; Pissis, Solon P.; Ramusat, Yann

doi:10.1007/978-3-662-55751-8_14

Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

Maxime Crochemore^15,16,
Alice Héliou¹⁷,
Gregory Kucherov¹⁶,
Laurent Mouchard¹⁸,
Solon P. Pissis¹⁵ &
…
Yann Ramusat¹⁹

Conference paper
First Online: 16 August 2017

754 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10472))

Abstract

An absent (or forbidden) word of a word y is a word that does not occur in y. It is then called minimal if all its proper factors occur in y. There exist linear-time and linear-space algorithms for computing all minimal absent words of y (Crochemore et al. in Inf Process Lett 67:111–117, 1998; Belazzougui et al. in ESA 8125:133–144, 2013; Barton et al. in BMC Bioinform 15:388, 2014). Minimal absent words are used for data compression (Crochemore et al. in Proc IEEE 88:1756–1768, 2000, Ota and Morita in Theoret Comput Sci 526:108–119, 2014) and for alignment-free sequence comparison by utilizing a metric based on minimal absent words (Chairungsee and Crochemore in Theoret Comput Sci 450:109–116, 2012). They are also used in molecular biology; for instance, three minimal absent words of the human genome were found to play a functional role in a coding region in Ebola virus genomes (Silva et al. in Bioinformatics 31:2421–2425, 2015). In this article we introduce a new application of minimal absent words for on-line pattern matching. Specifically, we present an algorithm that, given a pattern x and a text y, computes the distance between x and every window of size |x| on y. The running time is \(\mathcal {O}(\sigma |y|)\), where \(\sigma \) is the size of the alphabet. Along the way, we show an \(\mathcal {O}(\sigma |y|)\)-time and \(\mathcal {O}(\sigma |x|)\)-space algorithm to compute the minimal absent words of every window of size |x| on y, together with some new combinatorial insight on minimal absent words.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)
Article MathSciNet MATH Google Scholar
Almirantis, Y., Charalampopoulos, P., Gao, J., Iliopoulos, C.S., Mohamed, M., Pissis, S.P., Polychronopoulos, D.: On avoided words, absent words, and their application to biological sequence analysis. Algorithms Mol. Biol. 12(1), 5:1–5:12 (2017)
Google Scholar
Barton, C., Heliou, A., Mouchard, L., Pissis, S.P.: Linear-time computation of minimal absent words using suffix array. BMC Bioinform. 15, 11 (2014)
Article Google Scholar
Barton, C., Heliou, A., Mouchard, L., Pissis, S.P.: Parallelising the computation of minimal absent words. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 243–253. Springer, Cham (2016). doi:10.1007/978-3-319-32152-3_23
Chapter Google Scholar
Béal, M.-P., Mignosi, F., Restivo, A.: Minimal forbidden words and symbolic dynamics. In: Puech, C., Reischuk, R. (eds.) STACS 1996. LNCS, vol. 1046, pp. 555–566. Springer, Heidelberg (1996). doi:10.1007/3-540-60922-9_45
Chapter Google Scholar
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12
Chapter Google Scholar
Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoret. Comput. Sci. 450, 109–116 (2012)
Article MathSciNet MATH Google Scholar
Crochemore, M., Fici, G., Mercas, R., Pissis, S.P.: Linear-time sequence comparison using minimal absent words. In: Kranakis, E., Navarro, G., Chávez, E. (eds.) LATIN 2016. LNCS, vol. 9644, pp. 334–346. Springer, Heidelberg (2016). doi:10.1007/978-3-662-49529-2_25
Google Scholar
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Google Scholar
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Process. Lett. 67(3), 111–117 (1998)
Article MathSciNet MATH Google Scholar
Crochemore, M., Mignosi, F., Restivo, A., Salemi, S.: Data compression using antidictonaries. Proc. IEEE 88(11), 1756–1768 (2000)
Article Google Scholar
Dömölki, B.: An algorithm for syntactical analysis. Comput. Linguist. 3, 29–46 (1964)
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398. IEEE Computer Society (2000)
Google Scholar
Fici, G.: Minimal Forbidden Words and Applications. Thèse, Université de Marne la Vallée (2006)
Google Scholar
Fujishige, Y., Tsujimaru, Y., Inenaga, S., Bannai, H., Takeda, M.: Computing DAWGs and minimal absent words in linear time for integer alphabets. In: MFCS. LIPIcs, vol. 58, pp. 38:1–38:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2016)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Google Scholar
Hampikian, G., Andersen, T.L.: Absent sequences: nullomers and primes. In: PSB, pp. 355–366. World Scientific (2007)
Google Scholar
Heliou, A., Pissis, S.P., Puglisi, S.J.: emMAW: computing minimal absent words in external memory. Bioinformatics (2017)
Google Scholar
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9, 167 (2008)
Article Google Scholar
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Article MathSciNet MATH Google Scholar
Kucherov, G., Salikhov, K., Tsur, D.: Approximate string matching using a bidirectional index. Theoret. Comput. Sci. 638, 145–158 (2016)
Article MathSciNet MATH Google Scholar
Landau, G.M., Myers, E.W., Schmidt, J.P.: Incremental string comparison. SIAM J. Comput. 27–2, 557–582 (1998)
Article MathSciNet MATH Google Scholar
Mignosi, F., Restivo, A., Sciortino, M.: Words and forbidden factors. Theoret. Comput. Sci. 273(1–2), 99–117 (2002)
Article MathSciNet MATH Google Scholar
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)
Article MathSciNet MATH Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Google Scholar
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge (2008)
Google Scholar
Ota, T., Fukae, H., Morita, H.: Dynamic construction of an antidictionary with linear complexity. Theor. Comput. Sci. 526, 108–119 (2014)
Article MathSciNet MATH Google Scholar
Ota, T., Morita, H.: On a universal antidictionary coding for stationary ergodic sources with finite alphabet. In: ISITA, pp. 294–298. IEEE (2014)
Google Scholar
Rahman, M.S., Alatabbi, A., Athar, T., Crochemore, M., Rahman, M.S.: Absent words and the (dis)similarity analysis of DNA sequences: an experimental study. BMC Bioinform. Notes 9(1), 1–8 (2016)
Google Scholar
Senft, M.: Suffix tree for a sliding window: an overview. In: WDS, pp. 41–46. Matfyzpress (2005)
Google Scholar
Silva, R.M., Pratas, D., Castro, L., Pinho, A.J., Ferreira, P.J.S.G.: Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics 31(15), 2421–2425 (2015)
Article Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Wu, Z., Jiang, T., Su, W.: Efficient computation of shortest absent words in a genomic sequence. Inf. Process. Lett. 110(14–15), 596–601 (2010)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, King’s College London, London, UK
Maxime Crochemore & Solon P. Pissis
CNRS & Université Paris-Est, Champs-sur-Marne, France
Maxime Crochemore & Gregory Kucherov
LIX, École Polytechnique, CNRS, INRIA, Université Paris-Saclay, Palaiseau, France
Alice Héliou
University of Rouen, LITIS EA 4108, TIBS, Rouen, France
Laurent Mouchard
DI ENS, ENS, CNRS, PSL Research University & INRIA, Paris, France
Yann Ramusat

Authors

Maxime Crochemore
View author publications
You can also search for this author in PubMed Google Scholar
Alice Héliou
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Kucherov
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Mouchard
View author publications
You can also search for this author in PubMed Google Scholar
Solon P. Pissis
View author publications
You can also search for this author in PubMed Google Scholar
Yann Ramusat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alice Héliou .

Editor information

Editors and Affiliations

CNRS, University of Bordeaux, Talence cedex, France
Ralf Klasing
LaBRI, University of Bordeaux, Talence cedex, France
Marc Zeitoun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Crochemore, M., Héliou, A., Kucherov, G., Mouchard, L., Pissis, S.P., Ramusat, Y. (2017). Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching. In: Klasing, R., Zeitoun, M. (eds) Fundamentals of Computation Theory. FCT 2017. Lecture Notes in Computer Science(), vol 10472. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55751-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-662-55751-8_14
Published: 16 August 2017
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-55750-1
Online ISBN: 978-3-662-55751-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics