Skip to main content
Top

2018 | OriginalPaper | Chapter

Computing Burrows-Wheeler Similarity Distributions for String Collections

Authors : Felipe A. Louza, Guilherme P. Telles, Simon Gog, Liang Zhao

Published in: String Processing and Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this article we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the Burrows-Wheeler transform (BWT) computed for the concatenation of all strings, instead of the pairwise construction of BWTs performed by the straightforward approach, and use compressed data structures that allow reductions of running time while still keeping a small memory footprint, as shown by a set of experiments with real datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Pearson Education Ltd., Harlow (2011) Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Pearson Education Ltd., Harlow (2011)
2.
go back to reference Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report (1994) Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report (1994)
5.
go back to reference Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: pat trees and pat arrays. In: Information Retrieval, pp. 66–82. Prentice-Hall Inc., Upper Saddle River (1992) Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: pat trees and pat arrays. In: Information Retrieval, pp. 66–82. Prentice-Hall Inc., Upper Saddle River (1992)
6.
go back to reference Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003) Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)
7.
go back to reference Louza, F.A., Gog, S., Telles, G.P.: Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)MathSciNetCrossRef Louza, F.A., Gog, S., Telles, G.P.: Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)MathSciNetCrossRef
8.
go back to reference Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRef Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRef
9.
go back to reference Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 178–189. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_16CrossRefMATH Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 178–189. Springer, Heidelberg (2005). https://​doi.​org/​10.​1007/​11496656_​16CrossRefMATH
10.
go back to reference Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)MathSciNetCrossRef Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)MathSciNetCrossRef
11.
go back to reference Mantaci, S., Restivo, A., Sciortino, M.: Distance measures for biological sequences: some recent approaches. Int. J. Approx. Reason. 47(1), 109–124 (2008)MathSciNetCrossRef Mantaci, S., Restivo, A., Sciortino, M.: Distance measures for biological sequences: some recent approaches. Int. J. Approx. Reason. 47(1), 109–124 (2008)MathSciNetCrossRef
13.
go back to reference Munro, J.I., Nekrich, Y., Vitter, J.S.: Fast construction of wavelet trees. Theor. Comput. Sci. 638, 91–97 (2016)MathSciNetCrossRef Munro, J.I., Nekrich, Y., Vitter, J.S.: Fast construction of wavelet trees. Theor. Comput. Sci. 638, 91–97 (2016)MathSciNetCrossRef
14.
go back to reference Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666. ACM/SIAM (2002) Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 657–666. ACM/SIAM (2002)
15.
go back to reference Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inform. Syst. 31(3), 1–15 (2013)MathSciNetCrossRef Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inform. Syst. 31(3), 1–15 (2013)MathSciNetCrossRef
16.
go back to reference Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag (2013) Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag (2013)
17.
go back to reference Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of Workshop on Algorithm Engineering and Experimentation (ALENEX). SIAM (2007) Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of Workshop on Algorithm Engineering and Experimentation (ALENEX). SIAM (2007)
19.
go back to reference Paiva, J.G., Florian, L., Pedrini, H., Telles, G., Minghim, R.: Improved similarity trees and their application to visual data classification. IEEE Trans. Vis. Comput. Graph. 17(12), 2459–2468 (2011)CrossRef Paiva, J.G., Florian, L., Pedrini, H., Telles, G., Minghim, R.: Improved similarity trees and their application to visual data classification. IEEE Trans. Vis. Comput. Graph. 17(12), 2459–2468 (2011)CrossRef
20.
go back to reference Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comp. Surv. 39(2), 1–31 (2007)CrossRef Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comp. Surv. 39(2), 1–31 (2007)CrossRef
21.
go back to reference Yang, L., Zhang, X., Wang, T.: The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform. J. Theor. Biol. 262(4), 742–749 (2010)MathSciNetCrossRef Yang, L., Zhang, X., Wang, T.: The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform. J. Theor. Biol. 262(4), 742–749 (2010)MathSciNetCrossRef
Metadata
Title
Computing Burrows-Wheeler Similarity Distributions for String Collections
Authors
Felipe A. Louza
Guilherme P. Telles
Simon Gog
Liang Zhao
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-00479-8_23