nach oben

The Journal of Supercomputing

Erschienen in:

28.05.2020

High throughput BLAST algorithm using spark and cassandra

verfasst von: Fernando Cores, Fernando Guirado, Josep Lluis Lerida

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The rise of high-resolution and high-throughput sequencing technologies has driven the emergence of such new fields of application as precision medicine. However, this has also led to an increase in the storage and processing requirements for the bioinformatics tools, which can only be provided by high-performance and massive data processing infrastructures. Such technologies allow the development of scalable, efficient and reliable bioinformatics tools. In this paper, a new implementation of the Basic Local Alignment Search Tool algorithm is presented. Our proposal, named Sparky-Blast, utilizes Cassandra database to store the different reference datasets and the Apache Spark processing framework to calculate the indexes and process the queries. This successful approach avoids the bottleneck that suffers the original BLAST version that is limited to the resources of a single machine. Sparky-Blast is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, thus, improving both the response time and the system throughput. At the same time, the use of a distributed architecture like Hadoop provides unlimited scalability from the point of view of both the hardware infrastructure and performance.

Vorheriger Artikel Neighborhood search-based job scheduling for IoT big data real-time processing in distributed edge-cloud computing environment

Nächster Artikel A systematic literature review on hardware implementation of artificial intelligence algorithms

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The maximum row/record size that can be written in Cassandra is 16 MB.

The Sparky-Blast code is available in the following GitHub repository: https://github.com/Sherynan/SparkyBlast.

Executors are processes on the worker nodes whose job is to execute the assigned tasks for a Spark job. Executor runs tasks and keeps data in memory or disk storage across them. Each Spark application has its own executors, launched at the beginning of the application and typically run during its entire lifetime. A single node can run multiple executors and executors for an application can span multiple worker nodes.

Abuín JM, Pichel JC, Pena TF, Amigo J (2016) Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS ONE 11(5):e0155461CrossRef

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef

Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60CrossRef

Carpenter J, Hewitt E (2016) Cassandra: the definitive guide: distributed data at web scale. O’Reilly Media, Inc, Sebastopol

de Castro Rodrigo M, Tostes CdS, Dávila AMR, Senger H, da Silva FAB (2017) Sparkblast: scalable blast processing using in-memory operations. BMC Bioinf 18(1):318. https://doi.org/10.1186/s12859-017-1723-8CrossRef

Coulouris G et al (2016) Blast benchmaks. https://fiehnlab.ucdavis.edu/staff/kind/collector/benchmark/blast-benchmark

Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef

Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098

Karun AK, Chitharanjan K (2013) A review on hadoop—hdfs infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp 132–137

10.

Li X, Tan G, Zhang C, Li X, Zhang Z, Sun N (2016) Accelerating large-scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 747–751

11.

Lladós J, Cores F, Guirado F (2019) Optimization of consistency-based multiple sequence alignment using big data technologies. J Supercomput. 75(3):1310–1322https://doi.org/10.1007/s11227-018-2424-4CrossRef

12.

Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, IEEE, pp 222–229

13.

Mushtaq H, Ahmed N, Al-Ars Z (2017) Streaming distributed dna sequence alignment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, pp 188–193

14.

Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin MJ, Joseph AD, Patterson DA (2015) Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, ACM, New York, pp 631–646.https://doi.org/10.1145/2723372.2742787

15.

Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Molecular Cell 58(4):586–597CrossRef

16.

Sakr S (2017) Big data processing stacks. IT Professional 19(1):34–41CrossRef

17.

Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197CrossRef

18.

Xu B, Li C, Zhuang H, Wang J, Wang Q, Zhou X (2017) Efficient distributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, pp 608–615

19.

Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for big data: a survey. Proceed IEEE 104(11):2114–2136CrossRef

Titel: High throughput BLAST algorithm using spark and cassandra
verfasst von: Fernando Cores
Fernando Guirado
Josep Lluis Lerida
Publikationsdatum: 28.05.2020
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 2/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-020-03338-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 2/2021

Designing nanotechnology QCA–multiplexer using majority function-based NAND for quantum computing

Neighborhood search-based job scheduling for IoT big data real-time processing in distributed edge-cloud computing environment

ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters

A systematic literature review on hardware implementation of artificial intelligence algorithms

Accelerating number theoretic transform in GPU platform for fully homomorphic encryption

Stochastic models for performance and cost analysis of a hybrid cloud and fog architecture