Skip to main content
Erschienen in: Datenbank-Spektrum 2/2017

16.06.2017 | Fachbeitrag

Efficiently Storing and Analyzing Genome Data in Database Systems

verfasst von: Sebastian Dorok, Sebastian Breß, Jens Teubner, Horstfried Läpple, Gunter Saake, Volker Markl

Erschienen in: Datenbank-Spektrum | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Genome-analysis enables researchers to detect mutations within genomes and deduce their consequences. Researchers need reliable analysis platforms to ensure reproducible and comprehensive analysis results. Database systems provide vital support to implement the required sustainable procedures. Nevertheless, they are not used throughout the complete genome-analysis process, because (1) database systems suffer from high storage overhead for genome data and (2) they introduce overhead during domain-specific analysis. To overcome these limitations, we integrate genome-specific compression into database systems using a specialized database schema. Thus, we can reduce the storage consumption of a database approach by up to 35%. Moreover, we exploit genome-data characteristics during query processing allowing us to analyze real-world data sets up to five times faster than specialized analysis tools and eight times faster than a straightforward database approach.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Fußnoten
1
For simplicity, we only consider mismatching bases and omit inserted or deleted bases.
 
2
Using the base-centric database schema, we already apply CIGAR operations to the base values of reads.
 
3
We have to subtract a possible offset if the index of interest is encoded within the fill word.
 
Literatur
1.
Zurück zum Zitat Abadi D, Madden S, Ferreira M (2006) Integrating compression and execution in column-oriented database systems. SIGMOD, pp 671–682 Abadi D, Madden S, Ferreira M (2006) Integrating compression and execution in column-oriented database systems. SIGMOD, pp 671–682
2.
Zurück zum Zitat Abadi D, Madden S, Hachem N (2008) Column-stores vs. row-stores: How different are they really? SIGMOD, pp 967–980 Abadi D, Madden S, Hachem N (2008) Column-stores vs. row-stores: How different are they really? SIGMOD, pp 967–980
3.
Zurück zum Zitat Bhagwat D, Chiticariu L, Tan W-C, Vijayvargiya G (2004) An annotation management system for relational databases. VLDB, pp 900–911 Bhagwat D, Chiticariu L, Tan W-C, Vijayvargiya G (2004) An annotation management system for relational databases. VLDB, pp 900–911
4.
Zurück zum Zitat Bloniarz A, Talwalkar A, Terhorst J et al (2014) Changepoint analysis for efficient variant calling. RECOMB, pp 20–34 Bloniarz A, Talwalkar A, Terhorst J et al (2014) Changepoint analysis for efficient variant calling. RECOMB, pp 20–34
5.
Zurück zum Zitat Breß S (2014) The design and implementation of cogaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektr 14(3):199–209CrossRef Breß S (2014) The design and implementation of cogaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektr 14(3):199–209CrossRef
6.
Zurück zum Zitat Breß S, Funke H, Teubner J (2016) Robust query processing in co-processor-accelerated databases. SIGMOD, pp 1891–1906 Breß S, Funke H, Teubner J (2016) Robust query processing in co-processor-accelerated databases. SIGMOD, pp 1891–1906
7.
Zurück zum Zitat Bromberg Y (2013) Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 425(21):3993–4005CrossRef Bromberg Y (2013) Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 425(21):3993–4005CrossRef
8.
Zurück zum Zitat Ceri S, Kaitoua A, Masseroli M, Pinoli P, Venco F (2016) Data management for next generation genomic computing. EDBT, pp 485–490 Ceri S, Kaitoua A, Masseroli M, Pinoli P, Venco F (2016) Data management for next generation genomic computing. EDBT, pp 485–490
9.
Zurück zum Zitat Cijvat R, Manegold S, Kersten M et al (2015) Genome sequence analysis with MonetDB. Datenbank Spektrum 15(3):185–191CrossRef Cijvat R, Manegold S, Kersten M et al (2015) Genome sequence analysis with MonetDB. Datenbank Spektrum 15(3):185–191CrossRef
11.
Zurück zum Zitat DePristo M, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498CrossRef DePristo M, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498CrossRef
12.
Zurück zum Zitat Dorok S (2016) Memory efficient processing of DNA sequences in relational main-memory database systems. GvDB, pp 39–43 Dorok S (2016) Memory efficient processing of DNA sequences in relational main-memory database systems. GvDB, pp 39–43
13.
Zurück zum Zitat Dorok S (2017) Efficient storage and analysis of genome data in relational database systems. PhD thesis. School of Computer Science Dorok S (2017) Efficient storage and analysis of genome data in relational database systems. PhD thesis. School of Computer Science
14.
Zurück zum Zitat Dorok S, Breß S, Saake G (2014) Toward efficient variant calling inside main-memory database systems. BIOKDD-DEXA, pp 41–45 Dorok S, Breß S, Saake G (2014) Toward efficient variant calling inside main-memory database systems. BIOKDD-DEXA, pp 41–45
15.
Zurück zum Zitat Dorok S, Breß S, Teubner J et al (2017) Efficient storage and analysis of genome data in databases. BTW, pp 423–442 Dorok S, Breß S, Teubner J et al (2017) Efficient storage and analysis of genome data in databases. BTW, pp 423–442
16.
Zurück zum Zitat Eltabakh MY, Ouzzani M, Aref WG (2007) bdbms - A database management system for biological data. CIDR, pp 196–206 Eltabakh MY, Ouzzani M, Aref WG (2007) bdbms - A database management system for biological data. CIDR, pp 196–206
17.
Zurück zum Zitat Fähnrich C, Schapranow M, Plattner H (2015) Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology. SAC, pp 18–25 Fähnrich C, Schapranow M, Plattner H (2015) Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology. SAC, pp 18–25
18.
Zurück zum Zitat Hsi-Yang MF, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740CrossRef Hsi-Yang MF, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740CrossRef
20.
Zurück zum Zitat Lee TJ, Pouliot Y, Wagner V et al (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 7(1):170CrossRef Lee TJ, Pouliot Y, Wagner V et al (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 7(1):170CrossRef
21.
Zurück zum Zitat Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11(5):473–483CrossRef Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11(5):473–483CrossRef
22.
Zurück zum Zitat Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079CrossRef Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079CrossRef
23.
Zurück zum Zitat Liu L, Li Y, Li S et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11 Liu L, Li Y, Li S et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11
24.
Zurück zum Zitat Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Med 2(11):1–3CrossRef Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Med 2(11):1–3CrossRef
25.
Zurück zum Zitat Mavaddat N, Peock S, Frost D et al (2013) Cancer risks for BRCA1 and BRCA2 mutation carriers: results from prospective analysis of EMBRACE. J Natl Cancer Inst 105(11):812–822. doi:10.1093/jnci/djt095 CrossRef Mavaddat N, Peock S, Frost D et al (2013) Cancer risks for BRCA1 and BRCA2 mutation carriers: results from prospective analysis of EMBRACE. J Natl Cancer Inst 105(11):812–822. doi:10.​1093/​jnci/​djt095 CrossRef
26.
Zurück zum Zitat Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451CrossRef Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451CrossRef
27.
Zurück zum Zitat Quail M, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1):341CrossRef Quail M, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1):341CrossRef
28.
Zurück zum Zitat Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR. Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR.
31.
Zurück zum Zitat Shah SP, Huang Y, Xu T et al (2005) Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics 6:34CrossRef Shah SP, Huang Y, Xu T et al (2005) Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics 6:34CrossRef
32.
Zurück zum Zitat Stein LD, Thierry-Mieg J (1999) AceDB: A genome database management system. Comput Sci Eng 1(3):44–52CrossRef Stein LD, Thierry-Mieg J (1999) AceDB: A genome database management system. Comput Sci Eng 1(3):44–52CrossRef
33.
Zurück zum Zitat The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74CrossRef The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74CrossRef
34.
35.
Zurück zum Zitat Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endow 6(13):1534–1545CrossRef Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endow 6(13):1534–1545CrossRef
36.
Zurück zum Zitat Wu K, Otoo E, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38CrossRef Wu K, Otoo E, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38CrossRef
Metadaten
Titel
Efficiently Storing and Analyzing Genome Data in Database Systems
verfasst von
Sebastian Dorok
Sebastian Breß
Jens Teubner
Horstfried Läpple
Gunter Saake
Volker Markl
Publikationsdatum
16.06.2017
Verlag
Springer Berlin Heidelberg
Erschienen in
Datenbank-Spektrum / Ausgabe 2/2017
Print ISSN: 1618-2162
Elektronische ISSN: 1610-1995
DOI
https://doi.org/10.1007/s13222-017-0254-9

Weitere Artikel der Ausgabe 2/2017

Datenbank-Spektrum 2/2017 Zur Ausgabe

Editorial

Editorial