Skip to main content
Erschienen in: The Journal of Supercomputing 2/2021

28.05.2020

High throughput BLAST algorithm using spark and cassandra

verfasst von: Fernando Cores, Fernando Guirado, Josep Lluis Lerida

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The rise of high-resolution and high-throughput sequencing technologies has driven the emergence of such new fields of application as precision medicine. However, this has also led to an increase in the storage and processing requirements for the bioinformatics tools, which can only be provided by high-performance and massive data processing infrastructures. Such technologies allow the development of scalable, efficient and reliable bioinformatics tools. In this paper, a new implementation of the Basic Local Alignment Search Tool algorithm is presented. Our proposal, named Sparky-Blast, utilizes Cassandra database to store the different reference datasets and the Apache Spark processing framework to calculate the indexes and process the queries. This successful approach avoids the bottleneck that suffers the original BLAST version that is limited to the resources of a single machine. Sparky-Blast is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, thus, improving both the response time and the system throughput. At the same time, the use of a distributed architecture like Hadoop provides unlimited scalability from the point of view of both the hardware infrastructure and performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The maximum row/record size that can be written in Cassandra is 16 MB.
 
2
The Sparky-Blast code is available in the following GitHub repository: https://​github.​com/​Sherynan/​SparkyBlast.
 
3
Executors are processes on the worker nodes whose job is to execute the assigned tasks for a Spark job. Executor runs tasks and keeps data in memory or disk storage across them. Each Spark application has its own executors, launched at the beginning of the application and typically run during its entire lifetime. A single node can run multiple executors and executors for an application can span multiple worker nodes.
 
Literatur
1.
Zurück zum Zitat Abuín JM, Pichel JC, Pena TF, Amigo J (2016) Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS ONE 11(5):e0155461CrossRef Abuín JM, Pichel JC, Pena TF, Amigo J (2016) Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS ONE 11(5):e0155461CrossRef
2.
Zurück zum Zitat Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef
3.
Zurück zum Zitat Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60CrossRef Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60CrossRef
4.
Zurück zum Zitat Carpenter J, Hewitt E (2016) Cassandra: the definitive guide: distributed data at web scale. O’Reilly Media, Inc, Sebastopol Carpenter J, Hewitt E (2016) Cassandra: the definitive guide: distributed data at web scale. O’Reilly Media, Inc, Sebastopol
7.
Zurück zum Zitat Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef
8.
Zurück zum Zitat Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098 Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098
9.
Zurück zum Zitat Karun AK, Chitharanjan K (2013) A review on hadoop—hdfs infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp 132–137 Karun AK, Chitharanjan K (2013) A review on hadoop—hdfs infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp 132–137
10.
Zurück zum Zitat Li X, Tan G, Zhang C, Li X, Zhang Z, Sun N (2016) Accelerating large-scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 747–751 Li X, Tan G, Zhang C, Li X, Zhang Z, Sun N (2016) Accelerating large-scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 747–751
12.
Zurück zum Zitat Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, IEEE, pp 222–229 Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, IEEE, pp 222–229
13.
Zurück zum Zitat Mushtaq H, Ahmed N, Al-Ars Z (2017) Streaming distributed dna sequence alignment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, pp 188–193 Mushtaq H, Ahmed N, Al-Ars Z (2017) Streaming distributed dna sequence alignment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, pp 188–193
14.
Zurück zum Zitat Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin MJ, Joseph AD, Patterson DA (2015) Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, ACM, New York, pp 631–646.https://doi.org/10.1145/2723372.2742787 Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin MJ, Joseph AD, Patterson DA (2015) Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, ACM, New York, pp 631–646.https://​doi.​org/​10.​1145/​2723372.​2742787
15.
Zurück zum Zitat Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Molecular Cell 58(4):586–597CrossRef Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Molecular Cell 58(4):586–597CrossRef
16.
Zurück zum Zitat Sakr S (2017) Big data processing stacks. IT Professional 19(1):34–41CrossRef Sakr S (2017) Big data processing stacks. IT Professional 19(1):34–41CrossRef
17.
Zurück zum Zitat Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197CrossRef Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197CrossRef
18.
Zurück zum Zitat Xu B, Li C, Zhuang H, Wang J, Wang Q, Zhou X (2017) Efficient distributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, pp 608–615 Xu B, Li C, Zhuang H, Wang J, Wang Q, Zhou X (2017) Efficient distributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, pp 608–615
19.
Zurück zum Zitat Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for big data: a survey. Proceed IEEE 104(11):2114–2136CrossRef Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for big data: a survey. Proceed IEEE 104(11):2114–2136CrossRef
Metadaten
Titel
High throughput BLAST algorithm using spark and cassandra
verfasst von
Fernando Cores
Fernando Guirado
Josep Lluis Lerida
Publikationsdatum
28.05.2020
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 2/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-020-03338-3

Weitere Artikel der Ausgabe 2/2021

The Journal of Supercomputing 2/2021 Zur Ausgabe