nach oben

The Journal of Supercomputing

Erschienen in:

08.08.2016

An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

verfasst von: Giuseppe Cattaneo, Umberto Ferraro Petrillo, Raffaele Giancarlo, Gianluca Roscigno

Erschienen in: The Journal of Supercomputing | Ausgabe 4/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for Apache Hadoop that extends the problem sizes that can be processed with respect to a standard sequential machine while also granting a good time performance. Technically, as opposed to a standard Hadoop implementation, its effectiveness is achieved thanks to the incremental management of a persistent hash table during the map phase, a task not contemplated by the basic Hadoop functions and that can be useful also in other contexts.

Vorheriger Artikel Intelligent and independent processes for overcoming big graphs

Nächster Artikel A flexible and high-performance data center network topology

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Allen F, Almasi G, Andreoni W, Beece D, Berne BJ, Bright A, Brunheroto J, Cascaval C, Castanos J, Coteus P et al (2001) Blue Gene: a vision for protein science using a petaflop supercomputer. IBM Syst J 40(2):310–327CrossRef

Apostolico A, Giancarlo R (1998) Sequence alignment in molecular biology. J Comput Biol 5(2):173–196CrossRefMATH

Audano P, Vannberg F (2014) KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics 30(14):2070–2072CrossRef

Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C, Morgenstern B (2013) Alignment-free sequence comparison with spaced k-mers. OASIcs-OpenAccess Series in Informatics, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik 34:24–34MATH

Cattaneo G, Roscigno G, Ferraro Petrillo U (2014) A scalable approach to source camera identification over Hadoop. In: 28th IEEE International Conference on Advanced Information Networking and Applications (AINA), IEEE, pp 366–373

Cattaneo G, Ferraro Petrillo U, Giancarlo R, Roscigno G (2015) Alignment-free sequence comparison over Hadoop for computational biology. In: 44rd International Conference on Parallel Processing Workshops (ICCPW 2015), IEEE, pp 1–9

Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA (2014) Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Reports 4:6504CrossRef

Chor B, Horn D, Goldman N, Levy Y, Massingham T et al (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10(10):R108CrossRef

Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Operating Systems Design and Implementation (OSDI) pp 137–150

10.

Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576CrossRef

11.

Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, New YorkCrossRefMATH

12.

Ekanayake J, Pallickara S, Fox G (2008) MapReduce for data intensive scientific analyses. In: 2008 IEEE Fourth International Conference on eScience, pp 277–284

13.

Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pp 265–268

14.

Fan H, Ives AR, Surget-Groba Y, Cannon CH (2015) An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genom 16(1):1–18CrossRef

15.

Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinform 8:252CrossRef

16.

Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13):1575–1586CrossRefMATH

17.

Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14(1):1–14CrossRef

18.

Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings Bioinform 15(3):390–406CrossRef

19.

Greco V, Giancarlo R (2007) Grid-K: A cometa VO service for compression-based classification of biological sequences and structures. Symposium GRID Open Days at the University of Palermo, Italy pp 87–93

20.

Gunarathne T, Wu TL, Qiu J, Fox G (2010) MapReduce in the clouds for science. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 565–572

21.

Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New YorkCrossRefMATH

22.

Haubold B (2014) Alignment-free phylogenetics and population genetics. Briefings Bioinform 15(3):407–418CrossRef

23.

Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B (2014) Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42(W1):7–11CrossRef

24.

Huang K, Brady A, Mahurkar A, White O, Gevers D, Huttenhower C, Segata N (2013) MetaRef: a pan-genomic database for comparative and community microbial genomics

25.

Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86CrossRefMATHMathSciNet

26.

Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (2014) Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14):1991–1999CrossRefMATH

27.

Li KB (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12):1585–1586CrossRef

28.

Lloyd S, Snell Q (2011) Accelerated large-scale multiple sequence alignment. BMC Bioinform 12:466CrossRef

29.

Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770CrossRef

30.

Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA et al (2000) A whole-genome assembly of drosophila. Science 287(5461):2196–2204CrossRef

31.

Nordberg H, Bhatia K, Wang K, Wang Z (2013) BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23):3014–3019CrossRef

32.

Schatz MC (2009) Cloudburst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11):1363–1369CrossRef

33.

Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, pp 1–10

34.

Sims GE, Kim SH (2011) Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proceed Nat Acad Sci 108(20):8329–8334CrossRef

35.

Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud: models, techniques and applications, 1st edn. Elsevier Science Publishers B. V, Amsterdam

36.

Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):1–6CrossRef

37.

Torney DC, Burks C, Davison D, Sirotkin KM (1990) Computation of d2: a measure of sequence dissimilarity. In: Computers and DNA: the proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, Redwood City, Calif.: Addison-Wesley Pub. Co

38.

Vinga S (2014) Editorial: alignment-free methods in computational biology. Brief Bioinform 15(3):341–342CrossRef

39.

Vinga S, Almeida J (2003) Alignment-free sequence comparison-a review. Bioinformatics 19:513–523CrossRef

40.

Vouzis PD, Sahinidis NV (2010) GPU-BLAST: Using graphics processors to accelerate protein sequence alignment. Bioinformatics

41.

Warnke J, Pawaskar S, Ali H (2012) An energy-aware Bioinformatics application for assembling short reads in high performance computing systems. In: 2012 International Conference onHigh Performance Computing and Simulation (HPCS), pp 154–160

42.

Wong AK, You M (1985) Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans Patt Anal Mach Intel 7(5):599–609CrossRefMATH

43.

Yang K, Zhang L (2008) Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucl Acids Res 36(5):1–9CrossRef

Titel: An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop
verfasst von: Giuseppe Cattaneo
Umberto Ferraro Petrillo
Raffaele Giancarlo
Gianluca Roscigno
Publikationsdatum: 08.08.2016
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 4/2017
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-016-1835-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 4/2017

Towards designing a green data center farm for Internet services: Iran’s case study

Embedding cycles and paths on solid grid graphs

Load dispersion-aware VM placement in favor of energy-performance tradeoff

Microring fault-resilient photonic network-on-chip for reliable high-performance many-core systems

Energy-efficient NoC with multi-granularity power optimization

A flexible and high-performance data center network topology