skip to main content
10.1145/1646468.1646474acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Cloud technologies for bioinformatics applications

Published:16 November 2009Publication History

ABSTRACT

Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.

References

  1. M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, "Dryad: Distributed data-parallel programs from sequential building blocks," European Conference on Computer Systems, March 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Microsoft Windows Azure, http://www.microsoft.com/azure/windowsazure.mspxGoogle ScholarGoogle Scholar
  3. C. Kadie, PhyloD. Microsoft Computational Biology Tools. http://mscompbio.codeplex.com/Wiki/View.aspx?title=Phy loDGoogle ScholarGoogle Scholar
  4. X. Huang, A. Madan, "CAP3: A DNA Sequence Assembly Program," Genome Research, vol. 9, no. 9, pp. 868--877, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  5. S. L. Pallickara, M. Pierce, Q. Dong, and C. Kong, "Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters", PPAM 2009 EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS Wroclaw, Poland, September 13--16, 2009Google ScholarGoogle Scholar
  6. G. Fox, S. H. Bae, J. Ekanayake, X. Qiu, H. Yuan Parallel Data Mining from Multicore to Cloudy Grids Proceedings of HPC 2008 High Performance Computing and Grids workshop Cetraro Italy July 3 2008 http://grids.ucs.indiana.edu/ptliupages/publications/Cetraro WriteupJan09_v12.pdfGoogle ScholarGoogle Scholar
  7. M. A. Batzer, P. L. Deininger, 2002. "Alu Repeats And Human Genomic Diversity." Nature Reviews Genetics 3, no. 5: 370--379. 2002Google ScholarGoogle Scholar
  8. Apache Hadoop, http://hadoop.apache.org/core/Google ScholarGoogle Scholar
  9. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. Gunda, J. Currey, "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," Symposium on Operating System Design and Implementation (OSDI), CA, December 8--10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Ekanayake, A. S. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, R. Barga. "DryadLINQ for Scientific Analyses", Technical report, Submitted to eScience 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Source Code. Microsoft Computational Biology Tools. http://mscompbio.codeplex.com/SourceControl/ListDownloadableCommits.aspxGoogle ScholarGoogle Scholar
  12. J. Ekanayake, S. Pallickara, "MapReduce for Data Intensive Scientific Analysis," Fourth IEEE International Conference on eScience, 2008, pp. 277--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Bhattacharya, M. Daniels, D. Heckerman, B. Foley, N. Frahm, C. Kadie, J. Carlson, K. Yusim, B. McMahon, B. Gaschen, S. Mallal, J. I. Mullins, D. C. Nickle, J. Herbeck, C. Rousseau, G. H. Lear. PhyloD. Microsoft Computational Biology Web Tools. http://atom.research.microsoft.com/bio/phylod.aspxGoogle ScholarGoogle Scholar
  14. K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and RelatedOptimization Problems", Proceedings of the IEEE, vol. 80, pp. 2210--2239, November 1998.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox "Statistical mechanics and phase transitions in clustering" Phys. Rev. Lett. 65, 945--948 (1990)Google ScholarGoogle ScholarCross RefCross Ref
  16. T Hofmann, JM Buhmann "Pairwise data clustering by deterministic annealing", IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp1--13 1997 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Geoffrey Fox, Xiaohong Qiu, Scott Beason, Jong Youl Choi, Mina Rho, Haixu Tang, Neil Devadasan, Gilbert Liu "Biomedical Case Studies in Data Intensive Computing" Keynote talk at The 1st International Conference on Cloud Computing (CloudCom 2009) at Beijing Jiaotong University, China December 1--4, 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. F. A. Smit, R. Hubley, P. Green, 2004. Repeatmasker. http://www.repeatmasker.orgGoogle ScholarGoogle Scholar
  19. J. Jurka, 2000. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 9:418--420 (2000).Google ScholarGoogle Scholar
  20. Source Code. Smith Waterman Software. http://jaligner.sourceforge.net/naligner/.Google ScholarGoogle Scholar
  21. O. Gotoh, An improved algorithm for matching biological sequences. Journal of Molecular Biology 162:705--708 1982.Google ScholarGoogle ScholarCross RefCross Ref
  22. T. F. Smith, M. S. Waterman,. Identification of common molecular subsequences. Journal of Molecular Biology 147:195--197, 1981Google ScholarGoogle ScholarCross RefCross Ref
  23. I. Raicu, I. T. Foster, Y. Zhao, Many-Task Computing for Grids and Supercomputers,: Workshop on Many-Task Computing on Grids and Supercomputers MTAGS 2008. 17 Nov. 2008 IEEE pages 1--11Google ScholarGoogle Scholar
  24. C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, D. Thain, "All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids," IEEE Transactions on Parallel and Distributed Systems, 13 Mar. 2009, DOI 10.1109/TPDS.2009.49 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. C. Schatz "CloudBurst: highly sensitive read mapping with MapReduce", Bioinformatics 2009 25(11):1363--1369; doi:10.1093/bioinformatics/btp236 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, M. Tsugawa, "Science clouds: Early experiences in Cloud computing for scientific applications". In Cloud Computing and Applications 2008 (CCA08), 2008.Google ScholarGoogle Scholar
  27. J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, G. Fox "High Performance Parallel Computing with Clouds and Cloud Technologies", Technical Report August 25 2009 http://grids.ucs.indiana.edu/ptliupages/publications/CGLCloudReview.pdfGoogle ScholarGoogle Scholar
  28. M. Wilde, I. Raicu, A. Espinosa, Z. Zhang1, B. Clifford, M. Hategan, S. Kenny, K. Iskra, P. Beckman, I. Foster, "Extreme-scale scripting: Opportunities for large task parallel applications on petascale computers", SCIDAC 2009, Journal of Physics: Conference Series 180 (2009). DOI: 10.1088/1742-6596/180/1/012046Google ScholarGoogle Scholar

Index Terms

  1. Cloud technologies for bioinformatics applications

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in
                    • Published in

                      cover image ACM Conferences
                      MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
                      November 2009
                      131 pages
                      ISBN:9781605587141
                      DOI:10.1145/1646468

                      Copyright © 2009 ACM

                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 16 November 2009

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • research-article

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader