ABSTRACT
Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.
- M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, "Dryad: Distributed data-parallel programs from sequential building blocks," European Conference on Computer Systems, March 2007. Google ScholarDigital Library
- Microsoft Windows Azure, http://www.microsoft.com/azure/windowsazure.mspxGoogle Scholar
- C. Kadie, PhyloD. Microsoft Computational Biology Tools. http://mscompbio.codeplex.com/Wiki/View.aspx?title=Phy loDGoogle Scholar
- X. Huang, A. Madan, "CAP3: A DNA Sequence Assembly Program," Genome Research, vol. 9, no. 9, pp. 868--877, 1999.Google ScholarCross Ref
- S. L. Pallickara, M. Pierce, Q. Dong, and C. Kong, "Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters", PPAM 2009 EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS Wroclaw, Poland, September 13--16, 2009Google Scholar
- G. Fox, S. H. Bae, J. Ekanayake, X. Qiu, H. Yuan Parallel Data Mining from Multicore to Cloudy Grids Proceedings of HPC 2008 High Performance Computing and Grids workshop Cetraro Italy July 3 2008 http://grids.ucs.indiana.edu/ptliupages/publications/Cetraro WriteupJan09_v12.pdfGoogle Scholar
- M. A. Batzer, P. L. Deininger, 2002. "Alu Repeats And Human Genomic Diversity." Nature Reviews Genetics 3, no. 5: 370--379. 2002Google Scholar
- Apache Hadoop, http://hadoop.apache.org/core/Google Scholar
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. Gunda, J. Currey, "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," Symposium on Operating System Design and Implementation (OSDI), CA, December 8--10, 2008. Google ScholarDigital Library
- J. Ekanayake, A. S. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, R. Barga. "DryadLINQ for Scientific Analyses", Technical report, Submitted to eScience 2009 Google ScholarDigital Library
- Source Code. Microsoft Computational Biology Tools. http://mscompbio.codeplex.com/SourceControl/ListDownloadableCommits.aspxGoogle Scholar
- J. Ekanayake, S. Pallickara, "MapReduce for Data Intensive Scientific Analysis," Fourth IEEE International Conference on eScience, 2008, pp. 277--284. Google ScholarDigital Library
- T. Bhattacharya, M. Daniels, D. Heckerman, B. Foley, N. Frahm, C. Kadie, J. Carlson, K. Yusim, B. McMahon, B. Gaschen, S. Mallal, J. I. Mullins, D. C. Nickle, J. Herbeck, C. Rousseau, G. H. Lear. PhyloD. Microsoft Computational Biology Web Tools. http://atom.research.microsoft.com/bio/phylod.aspxGoogle Scholar
- K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and RelatedOptimization Problems", Proceedings of the IEEE, vol. 80, pp. 2210--2239, November 1998.Google ScholarCross Ref
- Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox "Statistical mechanics and phase transitions in clustering" Phys. Rev. Lett. 65, 945--948 (1990)Google ScholarCross Ref
- T Hofmann, JM Buhmann "Pairwise data clustering by deterministic annealing", IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp1--13 1997 Google ScholarDigital Library
- Geoffrey Fox, Xiaohong Qiu, Scott Beason, Jong Youl Choi, Mina Rho, Haixu Tang, Neil Devadasan, Gilbert Liu "Biomedical Case Studies in Data Intensive Computing" Keynote talk at The 1st International Conference on Cloud Computing (CloudCom 2009) at Beijing Jiaotong University, China December 1--4, 2009 Google ScholarDigital Library
- A. F. A. Smit, R. Hubley, P. Green, 2004. Repeatmasker. http://www.repeatmasker.orgGoogle Scholar
- J. Jurka, 2000. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 9:418--420 (2000).Google Scholar
- Source Code. Smith Waterman Software. http://jaligner.sourceforge.net/naligner/.Google Scholar
- O. Gotoh, An improved algorithm for matching biological sequences. Journal of Molecular Biology 162:705--708 1982.Google ScholarCross Ref
- T. F. Smith, M. S. Waterman,. Identification of common molecular subsequences. Journal of Molecular Biology 147:195--197, 1981Google ScholarCross Ref
- I. Raicu, I. T. Foster, Y. Zhao, Many-Task Computing for Grids and Supercomputers,: Workshop on Many-Task Computing on Grids and Supercomputers MTAGS 2008. 17 Nov. 2008 IEEE pages 1--11Google Scholar
- C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, D. Thain, "All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids," IEEE Transactions on Parallel and Distributed Systems, 13 Mar. 2009, DOI 10.1109/TPDS.2009.49 Google ScholarDigital Library
- M. C. Schatz "CloudBurst: highly sensitive read mapping with MapReduce", Bioinformatics 2009 25(11):1363--1369; doi:10.1093/bioinformatics/btp236 Google ScholarDigital Library
- K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, M. Tsugawa, "Science clouds: Early experiences in Cloud computing for scientific applications". In Cloud Computing and Applications 2008 (CCA08), 2008.Google Scholar
- J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, G. Fox "High Performance Parallel Computing with Clouds and Cloud Technologies", Technical Report August 25 2009 http://grids.ucs.indiana.edu/ptliupages/publications/CGLCloudReview.pdfGoogle Scholar
- M. Wilde, I. Raicu, A. Espinosa, Z. Zhang1, B. Clifford, M. Hategan, S. Kenny, K. Iskra, P. Beckman, I. Foster, "Extreme-scale scripting: Opportunities for large task parallel applications on petascale computers", SCIDAC 2009, Journal of Physics: Conference Series 180 (2009). DOI: 10.1088/1742-6596/180/1/012046Google Scholar
Index Terms
- Cloud technologies for bioinformatics applications
Recommendations
Hadoop Applications in Bioinformatics
OCS '12: Proceedings of the 2012 7th Open Cirrus SummitBioinformatics is in a dilemma that traditional analysis tools work hard on the large-scale data from the high-throughout sequencing. In recent years, the open source Apache Hadoop project, which adopts MapReduce framework and distributed file system, ...
'Big data', Hadoop and cloud computing in genomics
Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive
Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, ...
Comments