nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Scalable Gene Sequence Analysis on Spark

verfasst von : Muthahar Syed, Taehyun Hwang, Jinoh Kim

Erschienen in: Big Data and Visual Analytics

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a web-based interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Big Data Framework for Agile Business (BDFAB) As a Basis for Developing Holistic Strategies in Big Data Adoption

Nächstes Kapitel Big Sensor Data Acquisition and Archiving with Compression

Human Genome Project: [Online]. Available: https://www.genome.gov/10001772. Accessed 22 Nov 2015 (2003)

DNA Sequencing Costs: [Online]. Available: http://www.genome.gov/sequencingcosts/. Accessed 31 Jan 2016 (2016)

Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M., Moritz, R., Deutsch, E., Boyle, J.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinf., 13, 6 (2012)

Apache Spark Project: [Online]. Available: http://spark.apache.org. Accessed 2015 (2015)

Apache Pig: [Online]. Available: http://pig.apache.org/ (2015)

Apache Hive: [Online]. Available: https://cwiki.apache.org/confluence/display/Hive/Home (2016)

Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (2006)

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd short Title USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, Berkeley, CA, USA (2010)

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. UCB/EECS (2011)

10.

Apache Spark Documentation: [Online]. Available: http://spark.apache.org (2016)

11.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)CrossRef

12.

Massie, M., Nothaft, F.A., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.: ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. DEECS Department, University of California, Berkeley (2013)

13.

O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genomics. 16(1), 1052 (2015)CrossRef

14.

White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton, MA (2015)

15.

VCF File processing: [Online]. Available: http://vcftools.sourceforge.net (2014)

16.

1000 Genome Data: [Online]. Available: http://www.1000genomes.org/data (2014)

Titel: Scalable Gene Sequence Analysis on Spark
verfasst von: Muthahar Syed
Taehyun Hwang
Jinoh Kim
Verlag: Springer International Publishing
Buch: Big Data and Visual Analytics
Print ISBN: 978-3-319-63915-4

Electronic ISBN: 978-3-319-63917-8

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-63917-8_6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner