research-article

Cloud technologies for bioinformatics applications

Authors:
Xiaohong Qiu

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

,
Jaliya Ekanayake

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

,
Scott Beason

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

,
Thilina Gunarathne

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

,
Geoffrey Fox

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

,
Roger Barga

Microsoft Research, Microsoft Corporation, Redmond, WA

Microsoft Research, Microsoft Corporation, Redmond, WA
View Profile

,
Dennis Gannon

Microsoft Research, Microsoft Corporation, Redmond, WA

Microsoft Research, Microsoft Corporation, Redmond, WA
View Profile

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and SupercomputersNovember 2009Article No.: 6Pages 1–10https://doi.org/10.1145/1646468.1646474

Published:16 November 2009Publication History

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

Pages 1–10

ABSTRACT

Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.

References

M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, "Dryad: Distributed data-parallel programs from sequential building blocks," European Conference on Computer Systems, March 2007. Google ScholarDigital Library
Microsoft Windows Azure, http://www.microsoft.com/azure/windowsazure.mspxGoogle Scholar
C. Kadie, PhyloD. Microsoft Computational Biology Tools. http://mscompbio.codeplex.com/Wiki/View.aspx?title=Phy loDGoogle Scholar
X. Huang, A. Madan, "CAP3: A DNA Sequence Assembly Program," Genome Research, vol. 9, no. 9, pp. 868--877, 1999.Google ScholarCross Ref
S. L. Pallickara, M. Pierce, Q. Dong, and C. Kong, "Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters", PPAM 2009 EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS Wroclaw, Poland, September 13--16, 2009Google Scholar
G. Fox, S. H. Bae, J. Ekanayake, X. Qiu, H. Yuan Parallel Data Mining from Multicore to Cloudy Grids Proceedings of HPC 2008 High Performance Computing and Grids workshop Cetraro Italy July 3 2008 http://grids.ucs.indiana.edu/ptliupages/publications/Cetraro WriteupJan09_v12.pdfGoogle Scholar
M. A. Batzer, P. L. Deininger, 2002. "Alu Repeats And Human Genomic Diversity." Nature Reviews Genetics 3, no. 5: 370--379. 2002Google Scholar
Apache Hadoop, http://hadoop.apache.org/core/Google Scholar
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. Gunda, J. Currey, "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," Symposium on Operating System Design and Implementation (OSDI), CA, December 8--10, 2008. Google ScholarDigital Library
J. Ekanayake, A. S. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, R. Barga. "DryadLINQ for Scientific Analyses", Technical report, Submitted to eScience 2009 Google ScholarDigital Library
Source Code. Microsoft Computational Biology Tools. http://mscompbio.codeplex.com/SourceControl/ListDownloadableCommits.aspxGoogle Scholar
J. Ekanayake, S. Pallickara, "MapReduce for Data Intensive Scientific Analysis," Fourth IEEE International Conference on eScience, 2008, pp. 277--284. Google ScholarDigital Library
T. Bhattacharya, M. Daniels, D. Heckerman, B. Foley, N. Frahm, C. Kadie, J. Carlson, K. Yusim, B. McMahon, B. Gaschen, S. Mallal, J. I. Mullins, D. C. Nickle, J. Herbeck, C. Rousseau, G. H. Lear. PhyloD. Microsoft Computational Biology Web Tools. http://atom.research.microsoft.com/bio/phylod.aspxGoogle Scholar
K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and RelatedOptimization Problems", Proceedings of the IEEE, vol. 80, pp. 2210--2239, November 1998.Google ScholarCross Ref
Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox "Statistical mechanics and phase transitions in clustering" Phys. Rev. Lett. 65, 945--948 (1990)Google ScholarCross Ref
T Hofmann, JM Buhmann "Pairwise data clustering by deterministic annealing", IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp1--13 1997 Google ScholarDigital Library
Geoffrey Fox, Xiaohong Qiu, Scott Beason, Jong Youl Choi, Mina Rho, Haixu Tang, Neil Devadasan, Gilbert Liu "Biomedical Case Studies in Data Intensive Computing" Keynote talk at The 1st International Conference on Cloud Computing (CloudCom 2009) at Beijing Jiaotong University, China December 1--4, 2009 Google ScholarDigital Library
A. F. A. Smit, R. Hubley, P. Green, 2004. Repeatmasker. http://www.repeatmasker.orgGoogle Scholar
J. Jurka, 2000. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 9:418--420 (2000).Google Scholar
Source Code. Smith Waterman Software. http://jaligner.sourceforge.net/naligner/.Google Scholar
O. Gotoh, An improved algorithm for matching biological sequences. Journal of Molecular Biology 162:705--708 1982.Google ScholarCross Ref
T. F. Smith, M. S. Waterman,. Identification of common molecular subsequences. Journal of Molecular Biology 147:195--197, 1981Google ScholarCross Ref
I. Raicu, I. T. Foster, Y. Zhao, Many-Task Computing for Grids and Supercomputers,: Workshop on Many-Task Computing on Grids and Supercomputers MTAGS 2008. 17 Nov. 2008 IEEE pages 1--11Google Scholar
C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, D. Thain, "All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids," IEEE Transactions on Parallel and Distributed Systems, 13 Mar. 2009, DOI 10.1109/TPDS.2009.49 Google ScholarDigital Library
M. C. Schatz "CloudBurst: highly sensitive read mapping with MapReduce", Bioinformatics 2009 25(11):1363--1369; doi:10.1093/bioinformatics/btp236 Google ScholarDigital Library
K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, M. Tsugawa, "Science clouds: Early experiences in Cloud computing for scientific applications". In Cloud Computing and Applications 2008 (CCA08), 2008.Google Scholar
J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, G. Fox "High Performance Parallel Computing with Clouds and Cloud Technologies", Technical Report August 25 2009 http://grids.ucs.indiana.edu/ptliupages/publications/CGLCloudReview.pdfGoogle Scholar
M. Wilde, I. Raicu, A. Espinosa, Z. Zhang1, B. Clifford, M. Hategan, S. Kenny, K. Iskra, P. Beckman, I. Foster, "Extreme-scale scripting: Opportunities for large task parallel applications on petascale computers", SCIDAC 2009, Journal of Physics: Conference Series 180 (2009). DOI: 10.1088/1742-6596/180/1/012046Google Scholar

Index Terms

Recommendations

Hadoop Applications in Bioinformatics
OCS '12: Proceedings of the 2012 7th Open Cirrus Summit

Bioinformatics is in a dilemma that traditional analysis tools work hard on the large-scale data from the high-throughout sequencing. In recent years, the open source Apache Hadoop project, which adopts MapReduce framework and distributed file system, ...
Read More
'Big data', Hadoop and cloud computing in genomics

Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Read More
Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
November 2009
131 pages
ISBN:9781605587141
DOI:10.1145/1646468
Conference Chairs:
Ioan Raicu
Northwestern University
,
Ian Foster
University of Chicago & Argonne National Laboratory
,
Yong Zhao
Microsoft
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dryad
Hadoop
MPI
bioinformatics
cloud
multicore
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 1,093
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cloud technologies for bioinformatics applications

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hadoop Applications in Bioinformatics

'Big data', Hadoop and cloud computing in genomics

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive