Skip to main content
Erschienen in: Cognitive Computation 4/2016

01.08.2016

On Identifying Minimal Absent and Unique Words: An Efficient Scheme

verfasst von: Aqil M. Azmi

Erschienen in: Cognitive Computation | Ausgabe 4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

One of the basic tasks in genomic research is the analysis of a sequence. An absent word in a sequence is a substring that does not occur in the given sequence. Many studies looked into finding the shortest absent words, with some recent studies noting that longer absent words are also of interest. A simple extension of the shortest ones is impractical as the list tends to grow exponentially in the size of the sequence. A better choice is the minimal absent words, since these are known to grow linearly in the size of the sequence. An absent word is minimal if none of its proper factors is missing in the sequence. Similarly, it is (left-fixed) minimal unique if none of its proper prefixes is unique. In this paper we present an efficient algorithm that discovers all words up to a user-specified length that are either minimal absent or are left-fixed minimal unique in the input sequence. We employ a purely deterministic approach which guarantees nothing is overlooked. At each successive iteration, the algorithm works on larger words using a simple list structure for all the operations. Theoretically, the algorithm has a space complexity that is linear with the size of input sequence, while the time bound scales well with alphabet size. Experimental results using real biological sequences and randomly generated ones using different-sized alphabets show that the algorithm has a linearity in time behavior.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abouelhoda MI, Ghanem M. String mining in bioinformatics. In: Gaber MM, editor. Scientific data mining and knowledge discovery. Berlin: Springer; 2010. p. 207–47. Abouelhoda MI, Ghanem M. String mining in bioinformatics. In: Gaber MM, editor. Scientific data mining and knowledge discovery. Berlin: Springer; 2010. p. 207–47.
2.
Zurück zum Zitat Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms. 2004;2(1):53–86.CrossRef Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms. 2004;2(1):53–86.CrossRef
3.
Zurück zum Zitat Azmi AM, Al-Ssulami AM. Discovering common recurrent patterns in multiple strings over large alphabets. Pattern Recognit Lett. 2015;54(3):75–81.CrossRef Azmi AM, Al-Ssulami AM. Discovering common recurrent patterns in multiple strings over large alphabets. Pattern Recognit Lett. 2015;54(3):75–81.CrossRef
4.
Zurück zum Zitat Béal MP, Mignosi F, Restivo A. Minimal forbidden words and symbolic dynamics. In: STACS 96 (Grenoble, 1996), Springer, Berlin, lecture notes in computer science; 1996. vol 1046, p. 555–66. Béal MP, Mignosi F, Restivo A. Minimal forbidden words and symbolic dynamics. In: STACS 96 (Grenoble, 1996), Springer, Berlin, lecture notes in computer science; 1996. vol 1046, p. 555–66.
5.
Zurück zum Zitat Chairungsee S, Crochemore M. Building phylogeny with minimal absent words. In: Bouchou-Markhoff B, Caron P, Champarnaud JM, Maurel D, editors. Implementation and application of automata, vol. 6807., lecture notes in computer science. Berlin Heidelberg: Springer-Verlag; 2011. p. 100–9. Chairungsee S, Crochemore M. Building phylogeny with minimal absent words. In: Bouchou-Markhoff B, Caron P, Champarnaud JM, Maurel D, editors. Implementation and application of automata, vol. 6807., lecture notes in computer science. Berlin Heidelberg: Springer-Verlag; 2011. p. 100–9.
6.
Zurück zum Zitat Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. Cambridge: The MIT Press; 2001. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. Cambridge: The MIT Press; 2001.
7.
Zurück zum Zitat Darling A, Mau BF, Blattner N, Perna. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):394–403.CrossRef Darling A, Mau BF, Blattner N, Perna. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):394–403.CrossRef
8.
Zurück zum Zitat Garcia SP, Pinho AJ, Rodrigues JMOS, Bastos CAC, Ferreira PJSG. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One. 2011;6(1):e16065.CrossRefPubMedPubMedCentral Garcia SP, Pinho AJ, Rodrigues JMOS, Bastos CAC, Ferreira PJSG. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One. 2011;6(1):e16065.CrossRefPubMedPubMedCentral
9.
Zurück zum Zitat Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.CrossRef Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.CrossRef
10.
Zurück zum Zitat Hampikian G, Andersen T. Absent sequences: nullomers and primes. Pac Symp Biocomput. 2000;12:355–66. Hampikian G, Andersen T. Absent sequences: nullomers and primes. Pac Symp Biocomput. 2000;12:355–66.
13.
Zurück zum Zitat Hu X, Pei J, Tai Y. Shortest unique queries on strings. In: Proceedings of the 21st International symposium on string processing and information retrieval (SPIRE 2014); 2014. vol 8799, p. 161–72. Hu X, Pei J, Tai Y. Shortest unique queries on strings. In: Proceedings of the 21st International symposium on string processing and information retrieval (SPIRE 2014); 2014. vol 8799, p. 161–72.
14.
Zurück zum Zitat Malyshev DA, Dhami K, Lavergne T, Chen T, Dai N, Foster JM, Correa IR, Romesberg FE. A semi-synthetic organism with an expanded genetic alphabet. Nature. 2014. doi:10.1038/nature13314. Malyshev DA, Dhami K, Lavergne T, Chen T, Dai N, Foster JM, Correa IR, Romesberg FE. A semi-synthetic organism with an expanded genetic alphabet. Nature. 2014. doi:10.​1038/​nature13314.
15.
Zurück zum Zitat McCreight EM. A space-economical suffix tree construction algorithm . J ACM. 1976;23(2):262–72.CrossRef McCreight EM. A space-economical suffix tree construction algorithm . J ACM. 1976;23(2):262–72.CrossRef
18.
Zurück zum Zitat Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60.CrossRef Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60.CrossRef
19.
Zurück zum Zitat Wu ZD, Jiang T, Su WJ. Efficient computation of shortest absent words in a genomic sequence. Inf Process Lett. 2010;110(14–15):596–601.CrossRef Wu ZD, Jiang T, Su WJ. Efficient computation of shortest absent words in a genomic sequence. Inf Process Lett. 2010;110(14–15):596–601.CrossRef
Metadaten
Titel
On Identifying Minimal Absent and Unique Words: An Efficient Scheme
verfasst von
Aqil M. Azmi
Publikationsdatum
01.08.2016
Verlag
Springer US
Erschienen in
Cognitive Computation / Ausgabe 4/2016
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-016-9385-9

Weitere Artikel der Ausgabe 4/2016

Cognitive Computation 4/2016 Zur Ausgabe

Premium Partner