Top

Published in:

2018 | OriginalPaper | Chapter

Experiences in the Development of a Data Management System for Genomics

Authors : Stefano Ceri, Arif Canakoglu, Abdulrahman Kaitoua, Marco Masseroli, Pietro Pinoli

Published in: Data Management Technologies and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available.

In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions.

Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Utilization Measures in a Learning Management System

next chapter Server-Side Database Credentials: A Security Enhancing Approach for Database Access

http://www.bioinformatics.deib.polimi.it/gendata/, PRIN Italian National Project, 2013–2016.

Data-Driven Genomic Computing, http://www.bioinformatics.deib.polimi.it/geco/, ERC Advanced Grant, 2016–2021.

GeCo V2 software is available at https://github.com/DEIB-GECO/GMQL.

1000 Genomes Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)

Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of the epigenome. Nucleid Acids Res. 44(W1), W581–586 (2016)CrossRef

Anonymous paper, Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4 Inc. (2015). http://www.paradigm4.com/

Apache Flink. http://flink.apache.org/

Apache Lucene. http://lucene.apache.org/core/

Apache Pig. http://pig.apache.org/

Apache Spark. http://spark.apache.org/

Bernasconi, A., Ceri, S., Campi, A., Masseroli, M.: Conceptual modeling for genomics: building an integrated repository of open data. In: Mayr, H.C., Guizzardi, G., Ma, H., Pastor, O. (eds.) ER 2017. LNCS, vol. 10650, pp. 325–339. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69904-2_26CrossRef

Bertoni, M., et al.: Evaluating cloud frameworks on genomic applications. In: Proceedings of IEEE Conference on Big Data Management, Santa Clara, CA (2015)

10.

Cattani, S., et al.: Evaluating big data genomic applications on SciDB and Spark. In: Proceedings of Web Engineering Conference, Rome, IT (2017)

11.

Ceri, S., et al.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(6), 1251–1264 (2016)CrossRef

12.

Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)

13.

Chawda, B., et al.: Processing interval joins on Map-Reduce. In: Proceedings of EDBT, pp. 463–474 (2014)

14.

ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)

15.

Hadoop 2. http://hadoop.apache.org/docs/stable/

16.

Jalili, V., et al.: Explorative visual analytics on interval-based genomic data and their metadata. BMC Bioinform. 18, 536 (2017)CrossRef

17.

Kaitoua, A., et al.: Framework for supporting genomic operations. IEEE-TC (2016). https://doi.org/10.1109/TC.2016.2603980

18.

Kent, W.J.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)CrossRef

19.

Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)CrossRef

20.

Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)CrossRef

21.

Olston, C., et al.: Pig Latin: a not-so-foreign language for data processing. In: ACM-SIGMOD, pp. 1099–1110 (2008)

22.

Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: ACM Sigmod, Boston, MA (2017)

23.

Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5(1), 16–18 (2008)CrossRef

24.

SciDB. http://www.scidb.org/

25.

Shvachko, K., et al.: The Hadoop distributed file system. In: Proceedings of MSST, pp. 1–10 (2010)

26.

Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)CrossRef

27.

Taylor, R.C., et al.: An overview of the Hadoop MapReduce HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(Suppl. 12), S1 (2010)CrossRef

28.

Weinstein, J.N., et al.: The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRef

29.

Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX, pp. 15–28 (2012)

30.

Jensen, M.A., et al.: The NCI Genomic Data Commons as an engine for precision medicine. Blood 130(4), 453–459 (2017)CrossRef

Title: Experiences in the Development of a Data Management System for Genomics
Authors: Stefano Ceri
Arif Canakoglu
Abdulrahman Kaitoua
Marco Masseroli
Pietro Pinoli
Publisher: Springer International Publishing
Book: Data Management Technologies and Applications
Print ISBN: 978-3-319-94808-9

Electronic ISBN: 978-3-319-94809-6

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-94809-6_10

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner