Skip to main content

2017 | OriginalPaper | Buchkapitel

Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data

verfasst von : Anna Bernasconi, Stefano Ceri, Alessandro Campi, Marco Masseroli

Erschienen in: Conceptual Modeling

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete.
In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating relevant experimental datasets. First, we analyze the most typical metadata attributes of genomic sources and define their semantic properties. Then, we use a top-down method for building a global-as-view integrated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the conceptual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Data-Driven Genomic Computing, http://​www.​bioinformatics.​deib.​polimi.​it/​geco/​, ERC Advanced Grant, 2016–2021.
 
3
At https://​www.​encodeproject.​org/​profiles/​graph.​svg see the conceptual model of ENCODE, an ER schema with tens of entities and hundreds of relationships, which is neither readable nor supported by metadata for most concepts.
 
5
We will use the BRENDA Tissue and Enzyme Source Ontology [32] for tissues, the Cell Line Ontology [31] for cell lines, and the Human Disease Ontology [33] for human diseases.
 
7
Textual analysis to extract semantic information from the GEO repository is reported in [12]; we plan to reuse their library.
 
8
The metadata is provided in the NCI Genomic Data Commons portal, https://​docs.​gdc.​cancer.​gov/​Data_​Dictionary/​viewer/​.
 
9
GEO information can be retrieved through the R package GEOmetadb [37].
 
Literatur
1.
Zurück zum Zitat Adams, D., et al.: BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30(3), 224–226 (2012)CrossRef Adams, D., et al.: BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30(3), 224–226 (2012)CrossRef
2.
Zurück zum Zitat Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome. Nucleic Acids Res. 44(W1), W581–W586 (2016)CrossRef Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome. Nucleic Acids Res. 44(W1), W581–W586 (2016)CrossRef
3.
Zurück zum Zitat Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40(D1), 57–63 (2012)CrossRef Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40(D1), 57–63 (2012)CrossRef
4.
Zurück zum Zitat Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res. 41(Database issue), D991–D995 (2013) Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res. 41(Database issue), D991–D995 (2013)
5.
Zurück zum Zitat Bornberg-Bauer, E., Paton, N.W.: Conceptual data modelling for bioinformatics. Brief. Bioinform. 3(2), 166–180 (2002)CrossRef Bornberg-Bauer, E., Paton, N.W.: Conceptual data modelling for bioinformatics. Brief. Bioinform. 3(2), 166–180 (2002)CrossRef
6.
Zurück zum Zitat Buneman, P., et al.: A data transformation system for biological data sources. In: International Conference on Very Large Data Bases, pp. 158–169 (1995) Buneman, P., et al.: A data transformation system for biological data sources. In: International Conference on Very Large Data Bases, pp. 158–169 (1995)
7.
Zurück zum Zitat Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017) Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)
8.
Zurück zum Zitat Davidson, S.B., et al.: Biokleisli: a digital library for biomedical researchers. Int. J. Digit. Libr. 1(1), 36–53 (1997) Davidson, S.B., et al.: Biokleisli: a digital library for biomedical researchers. Int. J. Digit. Libr. 1(1), 36–53 (1997)
9.
Zurück zum Zitat Davidson, S.B., et al.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40(2), 512–531 (2001)CrossRef Davidson, S.B., et al.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40(2), 512–531 (2001)CrossRef
10.
Zurück zum Zitat El-Ghalayini, H., et al.: Deriving conceptual data models from domain ontologies for bioinformatics. In: 2006 2nd Information and Communication Technologies, ICTTA 2006, vol. 2, pp. 3562–3567 (2006) El-Ghalayini, H., et al.: Deriving conceptual data models from domain ontologies for bioinformatics. In: 2006 2nd Information and Communication Technologies, ICTTA 2006, vol. 2, pp. 3562–3567 (2006)
11.
Zurück zum Zitat Fernández, J.D., et al.: Ontology-based search of genomic metadata. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 233–247 (2016)CrossRef Fernández, J.D., et al.: Ontology-based search of genomic metadata. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 233–247 (2016)CrossRef
12.
Zurück zum Zitat Galeota, E., Pelizzola, M.: Ontology-based annotations and semantic relations in large-scale (epi)genomics data. Brief. Bioinform. 18(3), 403–412 (2017) Galeota, E., Pelizzola, M.: Ontology-based annotations and semantic relations in large-scale (epi)genomics data. Brief. Bioinform. 18(3), 403–412 (2017)
13.
Zurück zum Zitat Haider, S., et al.: BioMart Central Portal - unified access to biological data. Nucleic Acids Res. 37(Web Server issue), 23–27 (2009)CrossRef Haider, S., et al.: BioMart Central Portal - unified access to biological data. Nucleic Acids Res. 37(Web Server issue), 23–27 (2009)CrossRef
14.
Zurück zum Zitat Hernandez, T., Kambhampati, S.: Integration of biological sources: current systems and challenges ahead. SIGMOD Rec. 33(3), 51–60 (2004)CrossRef Hernandez, T., Kambhampati, S.: Integration of biological sources: current systems and challenges ahead. SIGMOD Rec. 33(3), 51–60 (2004)CrossRef
15.
Zurück zum Zitat Idrees, M., et al.: A review: conceptual data models for biological domain. JAPS, J. Anim. Plant Sci. 25(2), 337–345 (2015) Idrees, M., et al.: A review: conceptual data models for biological domain. JAPS, J. Anim. Plant Sci. 25(2), 337–345 (2015)
16.
Zurück zum Zitat Ji, F., Elmasri, R., et al.: Incorporating concepts for bioinformatics data modeling into EER models. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 189–192. IEEE Computer Society, Washington, DC, USA (2005) Ji, F., Elmasri, R., et al.: Incorporating concepts for bioinformatics data modeling into EER models. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 189–192. IEEE Computer Society, Washington, DC, USA (2005)
17.
Zurück zum Zitat Kaitoua, A., Pinoli, P., Bertoni, M., Ceri, S.: Framework for supporting genomic operations. IEEE Trans. Comput. 66(3), 443–457 (2017)MathSciNetMATHCrossRef Kaitoua, A., Pinoli, P., Bertoni, M., Ceri, S.: Framework for supporting genomic operations. IEEE Trans. Comput. 66(3), 443–457 (2017)MathSciNetMATHCrossRef
18.
Zurück zum Zitat Keet, M.C.: Biological data and conceptual modelling method. J. Concept. Model. 29(1), 1–14 (2003) Keet, M.C.: Biological data and conceptual modelling method. J. Concept. Model. 29(1), 1–14 (2003)
19.
Zurück zum Zitat Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)CrossRef Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)CrossRef
20.
Zurück zum Zitat Lenzerini, M.: Data integration: a theoretical perspective. In: Symposium on Principles of Database Systems, PODS, pp. 233–246. ACM, New York, NY, USA (2002) Lenzerini, M.: Data integration: a theoretical perspective. In: Symposium on Principles of Database Systems, PODS, pp. 233–246. ACM, New York, NY, USA (2002)
21.
Zurück zum Zitat Louie, B., et al.: Data integration and genomic medicine. J. Biomed. Inform. 40(1), 5–16 (2007)CrossRef Louie, B., et al.: Data integration and genomic medicine. J. Biomed. Inform. 40(1), 5–16 (2007)CrossRef
22.
Zurück zum Zitat Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016)CrossRef Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016)CrossRef
23.
Zurück zum Zitat Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)CrossRef Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)CrossRef
24.
Zurück zum Zitat Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)CrossRef Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)CrossRef
25.
Zurück zum Zitat Rechenmann, F.: Data modeling: the key to biological data integration. EMBnet. J. 18(B), 59–60 (2012)CrossRef Rechenmann, F.: Data modeling: the key to biological data integration. EMBnet. J. 18(B), 59–60 (2012)CrossRef
26.
Zurück zum Zitat Anonymous paper. Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4, April 2015. www.paradigm4.com Anonymous paper. Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4, April 2015. www.​paradigm4.​com
27.
Zurück zum Zitat Consortium 1000Genomes: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)CrossRef Consortium 1000Genomes: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)CrossRef
28.
Zurück zum Zitat Consortium ENCODE: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)CrossRef Consortium ENCODE: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)CrossRef
29.
Zurück zum Zitat Reyes Román, J.F., Pastor, Ó., Casamayor, J.C., Valverde, F.: Applying conceptual modeling to better understand the human genome. In: Comyn-Wattiau, I., Tanaka, K., Song, I.-Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 404–412. Springer, Cham (2016). doi:10.1007/978-3-319-46397-1_31CrossRef Reyes Román, J.F., Pastor, Ó., Casamayor, J.C., Valverde, F.: Applying conceptual modeling to better understand the human genome. In: Comyn-Wattiau, I., Tanaka, K., Song, I.-Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 404–412. Springer, Cham (2016). doi:10.​1007/​978-3-319-46397-1_​31CrossRef
30.
Zurück zum Zitat Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD 2017, Chicago, Illinois, USA, 14–19 May 2017, pp. 187–202. ACM, New York (2017) Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD 2017, Chicago, Illinois, USA, 14–19 May 2017, pp. 187–202. ACM, New York (2017)
31.
Zurück zum Zitat Sarntivijai, S., et al.: CLO: the cell line ontology. J. Biomed. Semant. 5(1), 37 (2014)CrossRef Sarntivijai, S., et al.: CLO: the cell line ontology. J. Biomed. Semant. 5(1), 37 (2014)CrossRef
32.
Zurück zum Zitat Schomburg, I., et al.: BRENDA in 2013: new options and contents in BRENDA. Nucleic Acids Res. 41(Database issue), D764–D772 (2013) Schomburg, I., et al.: BRENDA in 2013: new options and contents in BRENDA. Nucleic Acids Res. 41(Database issue), D764–D772 (2013)
33.
Zurück zum Zitat Schriml, L.M., et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 40(Database issue), 940–946 (2012)CrossRef Schriml, L.M., et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 40(Database issue), 940–946 (2012)CrossRef
34.
Zurück zum Zitat Smedley, D., et al.: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43(W1), 589–598 (2015)CrossRef Smedley, D., et al.: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43(W1), 589–598 (2015)CrossRef
35.
Zurück zum Zitat Wang, L., et al.: BioStar models of clinical and genomic data for biomedical data warehouse design. Int. J. Bioinform. Res. Appl. 1(1), 63–80 (2005)MathSciNetCrossRef Wang, L., et al.: BioStar models of clinical and genomic data for biomedical data warehouse design. Int. J. Bioinform. Res. Appl. 1(1), 63–80 (2005)MathSciNetCrossRef
36.
Zurück zum Zitat Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRef Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRef
37.
Zurück zum Zitat Zhu, Y., et al.: Geometadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics 24(23), 2798–2800 (2008)CrossRef Zhu, Y., et al.: Geometadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics 24(23), 2798–2800 (2008)CrossRef
Metadaten
Titel
Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data
verfasst von
Anna Bernasconi
Stefano Ceri
Alessandro Campi
Marco Masseroli
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-69904-2_26