Skip to main content
Top

2018 | OriginalPaper | Chapter

Genome Compression: An Image-Based Approach

Authors : Kelvin Vieira Kredens, Juliano Vieira Martins, Osmar Betazzi Dordal, Edson Emilio Scalabrin, Roberto Hiroshi Herai, Bráulio Coelho Ávila

Published in: Artificial Intelligence and Soft Computing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With the advent of Next Generation Sequencing Technologies, it has been possible to reduce the cost and time of genome sequencing. Thus, there was a significant increase in demand for genomes that were assembled daily. This demand requires more efficient techniques for storing and transmitting genomic data. In this research, we discussed the horizontal compression of lossless genomic sequences, using two image formats, WEBP, and FLIF. For this, the genomic sequence is transformed into a matrix of colored pixels, where an RGB color is assigned to each symbol of the A, T, C, G alphabet at a position x-y. The WEBP format showed the best data-rate saving (76.15%, SD = 0.84) when compared to FLIF. In addition, we compared the data-rate savings of two specialized DELIMINATE and MPCompress genomic data compression tools with WEBP. The results obtained show that the WEBP is close to DELIMINATE (76.03%, SD = 2.54%) and MFCompress (76.97%). SD = 1.36%). Finally, we suggest using WEBP for genomic data compression.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5, 16–18 (2008)CrossRef Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5, 16–18 (2008)CrossRef
2.
go back to reference Reuter, J.A., Spacek, D.V., Snyder, M.P.: High-throughput sequencing technologies. Mol. Cell 58(4), 586–597 (2015)CrossRef Reuter, J.A., Spacek, D.V., Snyder, M.P.: High-throughput sequencing technologies. Mol. Cell 58(4), 586–597 (2015)CrossRef
3.
go back to reference Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015)CrossRef Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015)CrossRef
4.
go back to reference Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)CrossRef Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)CrossRef
5.
go back to reference Hayden, E.C.: Genome researchers raise alarm over big data. Nature (2015) Hayden, E.C.: Genome researchers raise alarm over big data. Nature (2015)
6.
go back to reference Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Data Compression Conference DCC 1993, pp. 340–350 (1993) Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Data Compression Conference DCC 1993, pp. 340–350 (1993)
7.
go back to reference Yamagishi, M.E.B., Herai, R.H.: Chargaff’s “Grammar of Biology”: New Fractal-Like Rules. Quantitative Biology, Arxiv preprint arXiv, p. 17 (2011) Yamagishi, M.E.B., Herai, R.H.: Chargaff’s “Grammar of Biology”: New Fractal-Like Rules. Quantitative Biology, Arxiv preprint arXiv, p. 17 (2011)
8.
go back to reference Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W.C., Shago, M., Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Beeson, K.Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., Rogers, Y.-H., Frazier, M.E., Scherer, S.W., Strausberg, R.L., Venter, J.C.: The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007)CrossRef Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W.C., Shago, M., Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Beeson, K.Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., Rogers, Y.-H., Frazier, M.E., Scherer, S.W., Strausberg, R.L., Venter, J.C.: The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007)CrossRef
9.
go back to reference Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15, 390–406 (2013)CrossRef Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15, 390–406 (2013)CrossRef
10.
go back to reference Giancarlo, R., Scaturro, D., Utro, F.: Textual data compression in computational biology: algorithmic techniques. Comput. Sci. Rev. 6(1), 1–25 (2012)CrossRef Giancarlo, R., Scaturro, D., Utro, F.: Textual data compression in computational biology: algorithmic techniques. Comput. Sci. Rev. 6(1), 1–25 (2012)CrossRef
11.
go back to reference Nalbantoglu, Ö.U., Russell, D.J., Sayood, K.: Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12, 34–52 (2009)CrossRef Nalbantoglu, Ö.U., Russell, D.J., Sayood, K.: Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12, 34–52 (2009)CrossRef
12.
go back to reference Bhattacharyya, M., Bhattacharyya, M., Bandyopadhyay, S.: Recent directions in compressing next generation sequencing data. CBIO 7, 2–6 (2012)CrossRef Bhattacharyya, M., Bhattacharyya, M., Bandyopadhyay, S.: Recent directions in compressing next generation sequencing data. CBIO 7, 2–6 (2012)CrossRef
13.
go back to reference Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013)CrossRef Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013)CrossRef
14.
go back to reference Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15, 390–406 (2014)CrossRef Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. 15, 390–406 (2014)CrossRef
15.
go back to reference Bakr, N.S., Sharawi, A.A.: DNA lossless compression algorithms: review. Am. J. Bioinf. Res. 3(3), 72–81 (2013) Bakr, N.S., Sharawi, A.A.: DNA lossless compression algorithms: review. Am. J. Bioinf. Res. 3(3), 72–81 (2013)
16.
go back to reference Wandelt, S., Bux, M., Leser, U.: Trends in genome compression. Curr. Bioinform. 9, 315–326 (2014)CrossRef Wandelt, S., Bux, M., Leser, U.: Trends in genome compression. Curr. Bioinform. 9, 315–326 (2014)CrossRef
17.
go back to reference Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7, 56 (2016)CrossRef Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7, 56 (2016)CrossRef
18.
go back to reference Biji, C.L., Nair, A.S.: Benchmark dataset for whole genome sequence compression. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1228–1236 (2017)CrossRef Biji, C.L., Nair, A.S.: Benchmark dataset for whole genome sequence compression. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1228–1236 (2017)CrossRef
19.
go back to reference Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature committee of the international union of biochemistry (NC-IUB). Proc. Natl. Acad. Sci. U.S.A. 83, 4–8 (1986) Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature committee of the international union of biochemistry (NC-IUB). Proc. Natl. Acad. Sci. U.S.A. 83, 4–8 (1986)
20.
go back to reference Mohammed, M.H., Dutta, A., Bose, T., Chadaram, S., Mande, S.S.: DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics 28, 2527–2529 (2012)CrossRef Mohammed, M.H., Dutta, A., Bose, T., Chadaram, S., Mande, S.S.: DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics 28, 2527–2529 (2012)CrossRef
21.
go back to reference Pinho, A.J., Pratas, D.: MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 (2014)CrossRef Pinho, A.J., Pratas, D.: MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 (2014)CrossRef
22.
go back to reference Mann, H.B., Whitney, D.R.: Institute of mathematical statistics is collaborating with JSTOR to digitize, preserve, and extend access to the annals of mathematical statistics. Ann. Stat. 50–60. \(\textregistered \) https://www.jstor.org/ Mann, H.B., Whitney, D.R.: Institute of mathematical statistics is collaborating with JSTOR to digitize, preserve, and extend access to the annals of mathematical statistics. Ann. Stat. 50–60. \(\textregistered \) https://​www.​jstor.​org/​
23.
go back to reference Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)CrossRef Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)CrossRef
25.
go back to reference Nemenyi, P.: Distribution-Free Multiple Comparisons (1963) Nemenyi, P.: Distribution-Free Multiple Comparisons (1963)
26.
go back to reference Haubold, B., Wiehe, T.: How repetitive are genomes? BMC Bioinf. 7(1), 541 (2006)CrossRef Haubold, B., Wiehe, T.: How repetitive are genomes? BMC Bioinf. 7(1), 541 (2006)CrossRef
Metadata
Title
Genome Compression: An Image-Based Approach
Authors
Kelvin Vieira Kredens
Juliano Vieira Martins
Osmar Betazzi Dordal
Edson Emilio Scalabrin
Roberto Hiroshi Herai
Bráulio Coelho Ávila
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-91262-2_22

Premium Partner