Skip to main content
Erschienen in: Cluster Computing 3/2017

26.04.2017

Optimization of SAMtools sorting using OpenMP tasks

verfasst von: Nathan T. Weeks, Glenn R. Luecke

Erschienen in: Cluster Computing | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Source code for SAMtools optimizations available at https://​doi.​org/​10.​5281/​zenodo.​262169, and HTSlib optimizations at https://​doi.​org/​10.​5281/​zenodo.​262161
 
6
Note that taskyield is a no-op as of gcc 6.2.0
 
7
This check could occur after task generation and before returning from the routine; however, the implementation did not consistently perform as well in practice, possibly due to an undetermined effect on task scheduling.
 
Literatur
1.
Zurück zum Zitat Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010). doi:10.1002/cpe.1553 Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010). doi:10.​1002/​cpe.​1553
6.
Zurück zum Zitat Diekmann, R., Gehring, J., Luling, R., Monien, B., Nubel, M., Wanka, R.: Sorting large data sets on a massively parallel system. In: Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing, pp. 2–9 (1994). 10.1109/SPDP.1994.346188 Diekmann, R., Gehring, J., Luling, R., Monien, B., Nubel, M., Wanka, R.: Sorting large data sets on a massively parallel system. In: Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing, pp. 2–9 (1994). 10.​1109/​SPDP.​1994.​346188
10.
Zurück zum Zitat Kelly, B.J., Fitch, J.R., Hu, Y., Corsmeier, D.J., Zhong, H., Wetzel, A.N., Nordquist, R.D., Newsom, D.L., White, P.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015). doi:10.1186/s13059-014-0577-x CrossRef Kelly, B.J., Fitch, J.R., Hu, Y., Corsmeier, D.J., Zhong, H., Wetzel, A.N., Nordquist, R.D., Newsom, D.L., White, P.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015). doi:10.​1186/​s13059-014-0577-x CrossRef
11.
Zurück zum Zitat Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.P.D.P.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). doi:10.1093/bioinformatics/btp352 CrossRef Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.P.D.P.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). doi:10.​1093/​bioinformatics/​btp352 CrossRef
13.
16.
Zurück zum Zitat Puckelwartz, M.J., Pesce, L.L., Nelakuditi, V., Dellefave-Castillo, L., Golbus, J.R., Day, S.M., Cappola, T.P., Dorn II, G.W., Foster, I.T., McNally, E.M.: Supercomputing for the parallelization of whole genome analysis. Bioinformatics 30(11), 1508 (2014). doi:10.1093/bioinformatics/btu071 CrossRef Puckelwartz, M.J., Pesce, L.L., Nelakuditi, V., Dellefave-Castillo, L., Golbus, J.R., Day, S.M., Cappola, T.P., Dorn II, G.W., Foster, I.T., McNally, E.M.: Supercomputing for the parallelization of whole genome analysis. Bioinformatics 30(11), 1508 (2014). doi:10.​1093/​bioinformatics/​btu071 CrossRef
17.
Zurück zum Zitat Raczy, C., Petrovski, R., Saunders, C.T., Chorny, I., Kruglyak, S., Margulies, E.H., Chuang, H.Y., Kllberg, M., Kumar, S.A., Liao, A., Little, K.M., Strmberg, M.P., Tanner, S.W.: Isaac: ultra-fast whole-genome secondary analysis on illumina sequencing platforms. Bioinformatics 29(16), 2041 (2013). doi:10.1093/bioinformatics/btt314 CrossRef Raczy, C., Petrovski, R., Saunders, C.T., Chorny, I., Kruglyak, S., Margulies, E.H., Chuang, H.Y., Kllberg, M., Kumar, S.A., Liao, A., Little, K.M., Strmberg, M.P., Tanner, S.W.: Isaac: ultra-fast whole-genome secondary analysis on illumina sequencing platforms. Bioinformatics 29(16), 2041 (2013). doi:10.​1093/​bioinformatics/​btt314 CrossRef
19.
22.
Zurück zum Zitat Weeks, N.T., Luecke, G.R.: Performance analysis and optimization of SAMtools sorting. In: 4th International Workshop on Parallelism in Bioinformatics (PBio2016) (in press) Weeks, N.T., Luecke, G.R.: Performance analysis and optimization of SAMtools sorting. In: 4th International Workshop on Parallelism in Bioinformatics (PBio2016) (in press)
Metadaten
Titel
Optimization of SAMtools sorting using OpenMP tasks
verfasst von
Nathan T. Weeks
Glenn R. Luecke
Publikationsdatum
26.04.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 3/2017
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-0874-8

Weitere Artikel der Ausgabe 3/2017

Cluster Computing 3/2017 Zur Ausgabe