Skip to main content
Erschienen in: The Journal of Supercomputing 3/2019

24.05.2018

Optimization of consistency-based multiple sequence alignment using Big Data technologies

verfasst von: Jordi Lladós, Fernando Cores, Fernando Guirado

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With the advent of new high-throughput next-generation sequencing technologies, the volume of genetic data processed has increased significantly. It is becoming essential for these applications to achieve large-scale alignments with thousands of sequences or even whole genomes. However, all current MSA tools have exhibited scalability issues when the number of sequences increases. The main drawback of these methods is that errors made in early pairwise alignments are propagated to the final result, affecting the accuracy of the global alignment. The use of consistency information enables the final result to be improved and makes it more stable from the accuracy point of view. However, such methods are severely limited by the memory required to store the consistency information. Authors in a previous work analyzed the structure and distribution of the data stored in the constraint library and demonstrated that it could be possible to reduce it without loosing accuracy, and thus it is possible to increase the number of sequences to be aligned. However, the execution time for obtaining the constraint library for a bigger number of sequences also increases greatly. In the present paper, the authors apply Big Data technologies to take advantage of the high degree of parallelism provided by the MapReduce paradigm in order to reduce considerably the library calculation time. Moreover, Big Data infrastructure provides a distributed storage system to improve the library scalability and machine-learning algorithms to enhance the consistency selection policies.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
T-Coffee-MEL sources and its installation instructions can be found at: github.com/jllados/TCoffee-MEL.
 
Literatur
1.
Zurück zum Zitat Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef
2.
Zurück zum Zitat Do C, Brudno M, Batzoglou S (2004) PROBCONS: Probabilistic Consistency-based multiple alignment of amino acid sequences. In: Proceedings nineteenth national conference on artificial intelligence, pp 703–708 Do C, Brudno M, Batzoglou S (2004) PROBCONS: Probabilistic Consistency-based multiple alignment of amino acid sequences. In: Proceedings nineteenth national conference on artificial intelligence, pp 703–708
3.
Zurück zum Zitat Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211 Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211
4.
Zurück zum Zitat Gouy M, Guindon S, Gascuel O (2010) Seaview version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27(2):221–224CrossRef Gouy M, Guindon S, Gascuel O (2010) Seaview version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27(2):221–224CrossRef
5.
Zurück zum Zitat Gotoh O (1990) Consistency of optimal sequence alignments. Bull Math Biol 52(4):509–525CrossRefMATH Gotoh O (1990) Consistency of optimal sequence alignments. Bull Math Biol 52(4):509–525CrossRefMATH
6.
Zurück zum Zitat Just W (2001) Computational complexity of multiple sequence alignment with sp-score. J Comput Biol 8(6):615–623CrossRef Just W (2001) Computational complexity of multiple sequence alignment with sp-score. J Comput Biol 8(6):615–623CrossRef
7.
Zurück zum Zitat Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059–3066CrossRef Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059–3066CrossRef
8.
Zurück zum Zitat Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: IEEE Conference on Information & Communication Technologies, pp 132–137 Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: IEEE Conference on Information & Communication Technologies, pp 132–137
9.
Zurück zum Zitat Liu K, Linder CR, Warnow T (2010) Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr 2:RRN1198 Liu K, Linder CR, Warnow T (2010) Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr 2:RRN1198
10.
Zurück zum Zitat Lladós J, Cores F, Guirado F (2017) Efficient consistency library for multiple sequence alignment tools. Int Conf Comput Math Methods Sci Eng 4:1269–1280 Lladós J, Cores F, Guirado F (2017) Efficient consistency library for multiple sequence alignment tools. Int Conf Comput Math Methods Sci Eng 4:1269–1280
11.
Zurück zum Zitat Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotech 30(11):1072–1080CrossRef Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotech 30(11):1072–1080CrossRef
12.
Zurück zum Zitat Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217CrossRef Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217CrossRef
13.
Zurück zum Zitat Notredame C, Holm L, Higgins DG (1998) Coffee: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422CrossRef Notredame C, Holm L, Higgins DG (1998) Coffee: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422CrossRef
14.
Zurück zum Zitat Pruesse E, Peplies J, Glöckner FO (2012) SINA: accurate high throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28(14):1823–1829CrossRef Pruesse E, Peplies J, Glöckner FO (2012) SINA: accurate high throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28(14):1823–1829CrossRef
15.
Zurück zum Zitat Sadasivam G, Baktavatchalam G (2010) A novel approach to multiple sequence alignment using hadoop data grids. Int J Bioinform Res Appl 6(5):472–483CrossRef Sadasivam G, Baktavatchalam G (2010) A novel approach to multiple sequence alignment using hadoop data grids. Int J Bioinform Res Appl 6(5):472–483CrossRef
16.
17.
Zurück zum Zitat Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995CrossRef Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995CrossRef
18.
Zurück zum Zitat Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995CrossRef Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995CrossRef
19.
Zurück zum Zitat Subramanian AR, Weyer-Menkhoff J, Kaufmann M et al (2005) Dialign-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinform 6:66CrossRef Subramanian AR, Weyer-Menkhoff J, Kaufmann M et al (2005) Dialign-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinform 6:66CrossRef
20.
Zurück zum Zitat Thompson JD, Plewniak F, Poch O (1999) Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1):87–88CrossRef Thompson JD, Plewniak F, Poch O (1999) Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1):87–88CrossRef
21.
Zurück zum Zitat Wang L, Jiang T (1994) On the complexity of multiple sequence alignment. J Computat Biol 1(4):337–348CrossRef Wang L, Jiang T (1994) On the complexity of multiple sequence alignment. J Computat Biol 1(4):337–348CrossRef
22.
Zurück zum Zitat Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for Big Data: a survey. Proc IEEE 104(11):2114–2136CrossRef Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for Big Data: a survey. Proc IEEE 104(11):2114–2136CrossRef
23.
Zurück zum Zitat Zou Q, Hu Q, Guo M, Wang G (2015) HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15):2475–2481CrossRef Zou Q, Hu Q, Guo M, Wang G (2015) HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15):2475–2481CrossRef
Metadaten
Titel
Optimization of consistency-based multiple sequence alignment using Big Data technologies
verfasst von
Jordi Lladós
Fernando Cores
Fernando Guirado
Publikationsdatum
24.05.2018
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2019
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-018-2424-4

Weitere Artikel der Ausgabe 3/2019

The Journal of Supercomputing 3/2019 Zur Ausgabe

Premium Partner