Skip to main content

2017 | OriginalPaper | Buchkapitel

Flexible Indexing of Repetitive Collections

verfasst von : Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu Raffinot

Erschienen in: Unveiling Dynamics and Complexity

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
4
We compile the sequential version of https://​github.​com/​adamnovak/​rlcsa with PSI_FLAGS and SA_FLAGS turned off (in other words, we use a gap-encoded bitvector rather than a succinct bitvector to mark sampled positions in the suffix array). The block size of psi vectors (RLCSA_BLOCK_SIZE) is 32 bytes.
 
5
We perform all experiments on a single core of a 6-core, 2.50 GHz, Intel Xeon E5-2640 processor, with access to 128 GiB of RAM and running CentOS 6.3. We measure resources with GNU Time 1.7, and we compile with GCC 5.3.0.
 
6
We use as patterns random substrings of each dataset, containing just DNA bases, generated with the genpatterns tool from the Pizza&Chili repetitive corpus.
 
Literatur
1.
Zurück zum Zitat Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62, 54–101 (2012)MathSciNetCrossRefMATH Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62, 54–101 (2012)MathSciNetCrossRefMATH
2.
Zurück zum Zitat Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the STOC, pp. 148–193 (2014) Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the STOC, pp. 148–193 (2014)
3.
Zurück zum Zitat Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3 CrossRef Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.​1007/​978-3-319-19929-0_​3 CrossRef
4.
Zurück zum Zitat Blumer, A., et al.: Complete inverted files for efficient text retrieval and analysis. JACM 34, 578–595 (1987)MathSciNetCrossRef Blumer, A., et al.: Complete inverted files for efficient text retrieval and analysis. JACM 34, 578–595 (1987)MathSciNetCrossRef
5.
Zurück zum Zitat Chan, T.M., Larsen, K.G., Pătraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceediings of the SoCG, pp. 1–10 (2011) Chan, T.M., Larsen, K.G., Pătraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceediings of the SoCG, pp. 1–10 (2011)
6.
Zurück zum Zitat Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., et al. (eds.) Handbook of Formal Languages, pp. 399–462. Springer, Heidelberg (1997)CrossRef Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., et al. (eds.) Handbook of Formal Languages, pp. 399–462. Springer, Heidelberg (1997)CrossRef
7.
Zurück zum Zitat Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55 CrossRef Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.​1007/​3-540-63220-4_​55 CrossRef
10.
Zurück zum Zitat Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). doi:10.1007/978-3-642-54423-1_63 CrossRef Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). doi:10.​1007/​978-3-642-54423-1_​63 CrossRef
11.
Zurück zum Zitat Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28 Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.​1007/​978-3-319-07959-2_​28
12.
Zurück zum Zitat Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATH Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATH
13.
Zurück zum Zitat Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the WSP, pp. 141–155 (1996) Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the WSP, pp. 141–155 (1996)
14.
Zurück zum Zitat Kreft, S.: Self-index based on LZ77. Master’s thesis, Department of Computer Science, University of Chile (2010) Kreft, S.: Self-index based on LZ77. Master’s thesis, Department of Computer Science, University of Chile (2010)
16.
Zurück zum Zitat Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 45–56. Springer, Heidelberg (2005). doi:10.1007/11496656_5 CrossRef Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 45–56. Springer, Heidelberg (2005). doi:10.​1007/​11496656_​5 CrossRef
17.
Zurück zum Zitat Mäkinen, V., et al.: Storage and retrieval of highly repetitive sequence collections. JCB 17, 281–308 (2010)MathSciNet Mäkinen, V., et al.: Storage and retrieval of highly repetitive sequence collections. JCB 17, 281–308 (2010)MathSciNet
18.
Zurück zum Zitat Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. JACM 15, 514–534 (1968)CrossRef Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. JACM 15, 514–534 (1968)CrossRef
19.
Zurück zum Zitat Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 762–776 (2002)MathSciNetCrossRefMATH Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 762–776 (2002)MathSciNetCrossRefMATH
21.
Zurück zum Zitat Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008). doi:10.1007/978-3-540-89097-3_17 CrossRef Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008). doi:10.​1007/​978-3-540-89097-3_​17 CrossRef
22.
Zurück zum Zitat Valenzuela, D.: CHICO: a compressed hybrid index for repetitive collections. In: Goldberg, A.V., Kulikov, A.S. (eds.) SEA 2016. LNCS, vol. 9685, pp. 326–338. Springer, Cham (2016). doi:10.1007/978-3-319-38851-9_22 Valenzuela, D.: CHICO: a compressed hybrid index for repetitive collections. In: Goldberg, A.V., Kulikov, A.S. (eds.) SEA 2016. LNCS, vol. 9685, pp. 326–338. Springer, Cham (2016). doi:10.​1007/​978-3-319-38851-9_​22
23.
24.
Zurück zum Zitat Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23, 337–343 (1977)MathSciNetMATH Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23, 337–343 (1977)MathSciNetMATH
Metadaten
Titel
Flexible Indexing of Repetitive Collections
verfasst von
Djamal Belazzougui
Fabio Cunial
Travis Gagie
Nicola Prezza
Mathieu Raffinot
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-58741-7_17

Premium Partner