nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

Extracting Insights: A Data Centre Architecture Approach in Million Genome Era

verfasst von : Tariq Abdullah, Ahmed Ahmet

Erschienen in: Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Advances in high throughput sequencing technologies have resulted in a drastic reduction in genome sequencing price and led to an exponential growth in the generation of genomic sequencing data. The genomics data is often stored on shared repositories and is both heterogeneous and unstructured in nature. It is both technically and culturally residing in big data domain due to the challenges of volume, velocity and variety.

Appropriate data storage and management, processing and analytic models are required to meet the growing challenges of genomic and clinical data. Existing research on the storage, management and analyses of genomic and clinical data do not provide a comprehensive solution, either providing Hadoop based solution lacking a robust computing solution for data mining and knowledge discovery, or a distributed in memory solution that are effective in reducing runtime but lack robustness on data store, resource management, reservation, and scheduling.

In this paper, we present a scalable and elastic framework for genomic data storage, management, and processing that addresses the weaknesses of existing approaches. Fundamental to our framework is a distributed resource management system with a plug and play NoSQL component and an in-memory, distributed computing framework with machine learning and visualisation plugin tools. We evaluated Avro, CSV, HBase, ORC, Parquet datastores and benchmark their performance. A case study of machine learning based genotype clustering is presented to demonstrate and evaluate the effectiveness of the presented framework. The results show an overall performance improvement of the genomics data analysis pipeline by 49% from existing approaches. Finally, we make recommendations on the state of the art technology and tools for effective architecture approaches for the management and knowledge discovery from large datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel Dynamic Estimation and Grid Partitioning Approach for Multi-objective Optimization Problems in Medical Cloud Federations

Nur mit Berechtigung zugänglich

https://www.sevenbridges.com/rabixbeta/.

www.sanger.ac.uk.

http://blast.ncbi.nlm.nih.gov/Blast.cgi.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

Abdullah, T., Ahmet, A.: Genomics analyser: a big data framework for analysing genomics data. In: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, pp. 189–197 (2017)

Bateman, A., Wood, M.: Cloud computing. Bioinformatics 25(12), 1475 (2009)CrossRef

Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucl. Acids Res. 37(Database), D26–D31 (2009)

Brien, A.R.O., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genomics 16, 1–9 (2015)CrossRef

Shaffer, C.: Next-generation sequencing outpaces expectations. Nat. Biotechnol. 25 (2007)

Carter, R.J., Dubchak, I., Holbrook, S.R.: A computational approach to identify genes for functional RNAs in genomic sequences. Nucl. Acids Res. 29(19), 3928–3938 (2001)CrossRef

Hayden, E.C.: Genome researchers raise alarm over big data. Nature (2015)

Chen, X., Jorgenson, E., Cheung, S.: New tools for functional genomic analysis. Drug Discov. Today 14(15), 754–760 (2009)CrossRef

The 1000 Genome Project Consortium: A global reference for human genetic variations. Nature 256, 68–78 (2015)

10.

Cook, C.E., Bergman, M.T., Cochrane, G., Apweiler, R., Birney, E.: The European bioinformatics institute in 2017: data coordination and integration. Nucl. Acids Res. 29(19), 3928–3938 (2017)

11.

Coonrod, E., Margraf, R., Russell, A., Voelkerding, K., Reese, M.: Clinical analysis of genome next-generation sequencing data using the Omicia platform. Expert. Rev. Mol. Diagn. 13(6), 529–540 (2013)CrossRef

12.

Davies, K.: The 1,000 Dollar Genome - The Revolution in DNA Sequencing and the New Era of Personalized Medicine. Free Press (2010)

13.

de Paula, R., Holanda, M., Gomes, L.S.A., Lifschitz, S., Walter, M.E.M.T.: Provenance in bioinformatics workflows. In: BMC Bioinformatics Workshops (2013)

14.

Decap, D., Reumers, J., Herzeel, C., Costanza, P., Fostier, J.: Halvade: scalable sequence analysis with MapReduce. Bioinformatics 31(15), 2482–2488 (2015)CrossRef

15.

Ding, L., Wendl, M., Koboldt, D., Mardis, E.: Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum. Mol. Genet. 19(2), 188–196 (2010)CrossRef

16.

EMBL-EBI. EMBL-EBI annual scientific report 2013. Technical report, EMBL-European Bioinformatics Institute (2014)

17.

Borozan, I., et al.: CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genome and transcriptomes. BMC Bioinform. 13, 1–11 (2012)CrossRef

18.

National Center for Biotechnology Information. File format guide, U.S. National Library of Medicine. https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/

19.

Guo, X., Meng, Y., Yu, N., Pan, Y.: Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering. BMC Bioinform. 15(1), 102 (2014)CrossRef

20.

Gurovich,,Y., et al.: DeepGestalt-identifying rare genetic syndromes using deep learning. arXiv preprint arXiv:1801.07637 (2018)

21.

Huang, H., Tata, S., Prill, R.J.: BlueSNP. R package for highly scalable genome-wide association studies using Hadoop clusters. Bioinformatics 29(1), 135–136 (2013)CrossRef

22.

Huang, L., Kruger, J., Sczyrba, A.: Analyzing large scale genomic data on the cloud with Sparkhit. Bioinformatics 34(9), 1457–1465 (2017)CrossRef

23.

Data — 1000 Genomes. IGSR: The International Genome Sample Resource. https://www.internationalgenome.org/data

24.

Tian, J., Wu, N., Guo, X., Guo, J., Zhang, J., Fan, Y.: Predicting the phenotypic effects of non-synonymous single nucleotide polymorphisms based on support vector machines. BMC Bioinform. 8, 450–546 (2007) CrossRef

25.

Jourdren, L., Bernard, M., Dillies, M.A.L., Crom, S.: Eoulsan. A cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics 28(11), 1542–1543 (2012)CrossRef

26.

Kelly, B.J., et al.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015)

27.

Klinger, J., Mateos-Garcia, J.C., Stathoulopoulos, K.: Deep learning, deep change? Mapping the development of the artificial intelligence general purpose technology. Mapp. Dev. Artif. Intell. Gen. Purp. Technol. (2018)

28.

Kozanitis, C., Patterson, D.A.: GenAP: a distributed SQL interface for genomic data. BMC Bioinformat. 17(63) (2016)

29.

Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25CrossRef

30.

Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), 134:1–134:10 (2009)CrossRef

31.

Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)CrossRef

32.

Lu, W., Jackson, J., Barga, R.: AzureBlast: a case study of developing science applications on the cloud. In: 19th ACM International Symposium on High Performance Distributed Computing, pp. 413–420 (2010)

33.

Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends Genet. 24(3), 133–141 (2008)CrossRef

34.

Massie, M., et al.: Adam: genomics formats and processing patterns for cloud scale computing. Technical report UCB/EECS-2013-207, EECS Department, University of California, Berkeley, December 2013

35.

Mohammed, E.A., Far, B.H., Naugler, C.: Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends. BioData Min. 7(1), 1–23 (2014)CrossRef

36.

Wiewiorka, M.S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., Okoniewski, M.J.: SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 15(30), 2652–2653 (2014)CrossRef

37.

Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)CrossRef

38.

Norrgard, K.: Genetic variation and disease: GWAS. Nat. Educ. 1(1), 87(2008)

39.

O’Connor, B.D., Merriman, B., Nelson, S.F.: SeqWare query engine: storing and searching sequence data in the cloud. BMC Bioinform. 11(Suppl. 12), S2 (2010)

40.

Oliveira, J.H., Holanda, M., Guimaraes, V., Hondo, F., Filho, W.: Data modeling for NoSQL based on document. In: Second Annual International Symposium on Information Management and Big Data, pp. 129–135 (2015)

41.

Pinheiro, R., Holanda, M., Arujo, A., Walter, M.E.M.T., Lifschitz, S.: Automatic capture of provenance data in genome project workflows. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 15–21 (2013)

42.

Pinherio, R., Holanda, M., Araujo, A., Walter, M.E.M.t., Lifschitz., S.: Storing provenance data of genome project workflows using graph databases. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 16–22 (2014)

43.

Pireddu, L., Leo, S., Zanetti, G.: Seal: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011)CrossRef

44.

Poplin, R., et al.: A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36(10), 983–987 (2018)CrossRef

45.

1000 Genomes Project. Data types and file formats

46.

Zou, Q., Li, X.B., Jiang, W.R., Lin, Z.Y., Li, G.L., Chen, K.: Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform. 15, 637–647 (2014)CrossRef

47.

Qiu, J., et al.: Hybrid cloud and cluster computing paradigms for life science applications. BMC Bioinform. 11(12), 1–6 (2010). BioMed Central

48.

Quail, M.A., et al.: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1), 1–13 (2012). BioMed Central

49.

Robinson, T., Killcoyne, S., Bressler, R., Boyle, J.: SAMQA: error classification and validation of high-throughput sequenced read data. BMC Genomics 12, 419 (2011)CrossRef

50.

Schatz, M.C.: Cloudburst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)CrossRef

51.

Schoenherr, S., Forer, L., Weissensteiner, H., Specht, G., Kronenberg, F., Kloss-Brandstaetter, A.: Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinform. 13(1), 200 (2012)CrossRef

52.

Schumacher, A., et al.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)CrossRef

53.

Stein, L.D.: The case for cloud computing in genome informatics. Genome Biol. 11(5), 207 (2010)CrossRef

54.

Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)

55.

Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(S12), S1 (2010). Springer

56.

Wong, K.-C., Zhang, Z.: SNPdryad: predicting deleterious nonsynonymous human SNPs using only orthologous protein sequences. Bioinformatics 30(8), 1112–1119 (2014)CrossRef

57.

Yin, Z., Lan, H., Tan, G., Lu, M., Vasilakos, A., Liu, W.: Computing platforms for big biological data analytics: perspectives and challenges. Comput. Struct. Biotechnol. J. 15, 403–411 (2017)CrossRef

Titel: Extracting Insights: A Data Centre Architecture Approach in Million Genome Era
verfasst von: Tariq Abdullah
Ahmed Ahmet
Verlag: Springer Berlin Heidelberg
Buch: Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI
Print ISBN: 978-3-662-62385-5

Electronic ISBN: 978-3-662-62386-2

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-3-662-62386-2_1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"