Skip to main content
Top

2018 | OriginalPaper | Chapter

Experiences in the Development of a Data Management System for Genomics

Authors : Stefano Ceri, Arif Canakoglu, Abdulrahman Kaitoua, Marco Masseroli, Pietro Pinoli

Published in: Data Management Technologies and Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available.
In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions.
Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
2
Data-Driven Genomic Computing, http://​www.​bioinformatics.​deib.​polimi.​it/​geco/​, ERC Advanced Grant, 2016–2021.
 
3
GeCo V2 software is available at https://​github.​com/​DEIB-GECO/​GMQL.
 
Literature
1.
go back to reference 1000 Genomes Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012) 1000 Genomes Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
2.
go back to reference Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of the epigenome. Nucleid Acids Res. 44(W1), W581–586 (2016)CrossRef Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of the epigenome. Nucleid Acids Res. 44(W1), W581–586 (2016)CrossRef
9.
go back to reference Bertoni, M., et al.: Evaluating cloud frameworks on genomic applications. In: Proceedings of IEEE Conference on Big Data Management, Santa Clara, CA (2015) Bertoni, M., et al.: Evaluating cloud frameworks on genomic applications. In: Proceedings of IEEE Conference on Big Data Management, Santa Clara, CA (2015)
10.
go back to reference Cattani, S., et al.: Evaluating big data genomic applications on SciDB and Spark. In: Proceedings of Web Engineering Conference, Rome, IT (2017) Cattani, S., et al.: Evaluating big data genomic applications on SciDB and Spark. In: Proceedings of Web Engineering Conference, Rome, IT (2017)
11.
go back to reference Ceri, S., et al.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(6), 1251–1264 (2016)CrossRef Ceri, S., et al.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(6), 1251–1264 (2016)CrossRef
12.
go back to reference Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017) Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)
13.
go back to reference Chawda, B., et al.: Processing interval joins on Map-Reduce. In: Proceedings of EDBT, pp. 463–474 (2014) Chawda, B., et al.: Processing interval joins on Map-Reduce. In: Proceedings of EDBT, pp. 463–474 (2014)
14.
go back to reference ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012) ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)
16.
go back to reference Jalili, V., et al.: Explorative visual analytics on interval-based genomic data and their metadata. BMC Bioinform. 18, 536 (2017)CrossRef Jalili, V., et al.: Explorative visual analytics on interval-based genomic data and their metadata. BMC Bioinform. 18, 536 (2017)CrossRef
18.
go back to reference Kent, W.J.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)CrossRef Kent, W.J.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)CrossRef
19.
go back to reference Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)CrossRef Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)CrossRef
20.
go back to reference Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)CrossRef Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)CrossRef
21.
go back to reference Olston, C., et al.: Pig Latin: a not-so-foreign language for data processing. In: ACM-SIGMOD, pp. 1099–1110 (2008) Olston, C., et al.: Pig Latin: a not-so-foreign language for data processing. In: ACM-SIGMOD, pp. 1099–1110 (2008)
22.
go back to reference Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: ACM Sigmod, Boston, MA (2017) Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: ACM Sigmod, Boston, MA (2017)
23.
go back to reference Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5(1), 16–18 (2008)CrossRef Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5(1), 16–18 (2008)CrossRef
25.
go back to reference Shvachko, K., et al.: The Hadoop distributed file system. In: Proceedings of MSST, pp. 1–10 (2010) Shvachko, K., et al.: The Hadoop distributed file system. In: Proceedings of MSST, pp. 1–10 (2010)
26.
go back to reference Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)CrossRef Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)CrossRef
27.
go back to reference Taylor, R.C., et al.: An overview of the Hadoop MapReduce HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(Suppl. 12), S1 (2010)CrossRef Taylor, R.C., et al.: An overview of the Hadoop MapReduce HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(Suppl. 12), S1 (2010)CrossRef
28.
go back to reference Weinstein, J.N., et al.: The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRef Weinstein, J.N., et al.: The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRef
29.
go back to reference Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX, pp. 15–28 (2012) Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX, pp. 15–28 (2012)
30.
go back to reference Jensen, M.A., et al.: The NCI Genomic Data Commons as an engine for precision medicine. Blood 130(4), 453–459 (2017)CrossRef Jensen, M.A., et al.: The NCI Genomic Data Commons as an engine for precision medicine. Blood 130(4), 453–459 (2017)CrossRef
Metadata
Title
Experiences in the Development of a Data Management System for Genomics
Authors
Stefano Ceri
Arif Canakoglu
Abdulrahman Kaitoua
Marco Masseroli
Pietro Pinoli
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-94809-6_10

Premium Partner