nach oben

Datenbank-Spektrum

Erschienen in:

16.06.2017 | Fachbeitrag

Efficiently Storing and Analyzing Genome Data in Database Systems

verfasst von: Sebastian Dorok, Sebastian Breß, Jens Teubner, Horstfried Läpple, Gunter Saake, Volker Markl

Erschienen in: Datenbank-Spektrum | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Genome-analysis enables researchers to detect mutations within genomes and deduce their consequences. Researchers need reliable analysis platforms to ensure reproducible and comprehensive analysis results. Database systems provide vital support to implement the required sustainable procedures. Nevertheless, they are not used throughout the complete genome-analysis process, because (1) database systems suffer from high storage overhead for genome data and (2) they introduce overhead during domain-specific analysis. To overcome these limitations, we integrate genome-specific compression into database systems using a specialized database schema. Thus, we can reduce the storage consumption of a database approach by up to 35%. Moreover, we exploit genome-data characteristics during query processing allowing us to analyze real-world data sets up to five times faster than specialized analysis tools and eight times faster than a straightforward database approach.

Vorheriger Artikel Dynamic Event-Activity Networks in Public Transportation

Nächster Artikel Reducing the Distance Calculations when Searching an M‑Tree

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Datenbank-Spektrum

Datenbank-Spektrum ist das offizielle Organ der Fachgruppe Datenbanken und Information Retrieval der Gesellschaft für Informatik (GI) e.V. Die Zeitschrift widmet sich den Themen Datenbanken, Datenbankanwendungen und Information Retrieval.

Jetzt informieren

For simplicity, we only consider mismatching bases and omit inserted or deleted bases.

Using the base-centric database schema, we already apply CIGAR operations to the base values of reads.

We have to subtract a possible offset if the index of interest is encoded within the fill word.

data is available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/

Abadi D, Madden S, Ferreira M (2006) Integrating compression and execution in column-oriented database systems. SIGMOD, pp 671–682

Abadi D, Madden S, Hachem N (2008) Column-stores vs. row-stores: How different are they really? SIGMOD, pp 967–980

Bhagwat D, Chiticariu L, Tan W-C, Vijayvargiya G (2004) An annotation management system for relational databases. VLDB, pp 900–911

Bloniarz A, Talwalkar A, Terhorst J et al (2014) Changepoint analysis for efficient variant calling. RECOMB, pp 20–34

Breß S (2014) The design and implementation of cogaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektr 14(3):199–209CrossRef

Breß S, Funke H, Teubner J (2016) Robust query processing in co-processor-accelerated databases. SIGMOD, pp 1891–1906

Bromberg Y (2013) Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 425(21):3993–4005CrossRef

Ceri S, Kaitoua A, Masseroli M, Pinoli P, Venco F (2016) Data management for next generation genomic computing. EDBT, pp 485–490

Cijvat R, Manegold S, Kersten M et al (2015) Genome sequence analysis with MonetDB. Datenbank Spektrum 15(3):185–191CrossRef

10.

Working Group (2015) CRAM Format Specification. https://samtools.github.io/hts-specs/CRAMv3.pdf

11.

DePristo M, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498CrossRef

12.

Dorok S (2016) Memory efficient processing of DNA sequences in relational main-memory database systems. GvDB, pp 39–43

13.

Dorok S (2017) Efficient storage and analysis of genome data in relational database systems. PhD thesis. School of Computer Science

14.

Dorok S, Breß S, Saake G (2014) Toward efficient variant calling inside main-memory database systems. BIOKDD-DEXA, pp 41–45

15.

Dorok S, Breß S, Teubner J et al (2017) Efficient storage and analysis of genome data in databases. BTW, pp 423–442

16.

Eltabakh MY, Ouzzani M, Aref WG (2007) bdbms - A database management system for biological data. CIDR, pp 196–206

17.

Fähnrich C, Schapranow M, Plattner H (2015) Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology. SAC, pp 18–25

18.

Hsi-Yang MF, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740CrossRef

19.

Kuenne C, Grosse I, Matthies I et al (2007) Using data warehouse technology in crop plant bioinformatics. J Integr Bioinform 4(1). doi:10.2390/biecoll-jib-2007-88

20.

Lee TJ, Pouliot Y, Wagner V et al (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 7(1):170CrossRef

21.

Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11(5):473–483CrossRef

22.

Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079CrossRef

23.

Liu L, Li Y, Li S et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11

24.

Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Med 2(11):1–3CrossRef

25.

Mavaddat N, Peock S, Frost D et al (2013) Cancer risks for BRCA1 and BRCA2 mutation carriers: results from prospective analysis of EMBRACE. J Natl Cancer Inst 105(11):812–822. doi:10.1093/jnci/djt095 CrossRef

26.

Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451CrossRef

27.

Quail M, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1):341CrossRef

28.

Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR.

29.

SAM/BAM Format Specification Working Group (2015) Sequence alignment/map format specification. https://samtools.github.io/hts-specs/SAMv1.pdf

30.

Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten simple rules for reproducible computational research. PLoS Comput Biol 9(10). doi:10.1371/journal.pcbi.1003285

31.

Shah SP, Huang Y, Xu T et al (2005) Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics 6:34CrossRef

32.

Stein LD, Thierry-Mieg J (1999) AceDB: A genome database management system. Comput Sci Eng 1(3):44–52CrossRef

33.

The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74CrossRef

34.

Töpel T, Kormeier B, Klassen A, Hofestädt R (2008) BioDWH: a data warehouse kit for life science data integration. J Integr Bioinform 5(2). doi:10.2390/biecoll-jib-2008-93

35.

Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endow 6(13):1534–1545CrossRef

36.

Wu K, Otoo E, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38CrossRef

Titel: Efficiently Storing and Analyzing Genome Data in Database Systems
verfasst von: Sebastian Dorok
Sebastian Breß
Jens Teubner
Horstfried Läpple
Gunter Saake
Volker Markl
Publikationsdatum: 16.06.2017
Verlag: Springer Berlin Heidelberg
Erschienen in: Datenbank-Spektrum / Ausgabe 2/2017
Print ISSN: 1618-2162
Elektronische ISSN: 1610-1995
DOI: https://doi.org/10.1007/s13222-017-0254-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Datenbank-Spektrum

Weitere Artikel der Ausgabe 2/2017

Daten wie Sand am Meer – Datenerhebung, -strukturierung, -management und Data Provenance für die Ostseeforschung

Editorial

BTW 2017 in Stuttgart

Dynamic Event-Activity Networks in Public Transportation

Big Graph Data Analytics on Single Machines – An Overview

Reducing the Distance Calculations when Searching an M‑Tree