ABSTRACT
Complex and large-scale applications in different scientific disciplines are often represented as a set of independent tasks, known as workflows. Many scientific workflows have intensive resource requirements. Therefore, different distributed platforms, including campus clusters, grids and clouds are used for efficient execution of these workflows. In this paper we examine the performance and the cost of running the Pegasus Workflow Management System (Pegasus WMS) implementation of blast2cap3, the protein-guided assembly approach, on three different execution platforms: Sandhills, the University of Nebraska Campus Cluster, the academic grid Open Science Gird (OSG), and the commercial cloud Amazon EC2. Furthermore, the behavior of the blast2cap3 workflow was tested with different number of tasks. For the used workflows and execution platforms, we perform multiple runs in order to compare the total workflow running time, as well as the different resource availability over time. Additionally, for the most interesting runs, the number of running versus the number of idle jobs over time was analyzed for each platform. The performed experiments show that using the Pegasus WMS implementation of blast2cap3 with more than 100 tasks significantly reduces the running time for all execution platforms. In general, for our workflow, better performance and resource usage were achieved when Amazon EC2 was used as an execution platform. However, due to the Amazon EC2 cost, the academic distributed systems can sometimes be a good alternative and have excellent performance, especially when there are plenty of resources available.
- E. Deelman, J. Blythe, Y. Gil, C. Kesselman, "Pegasus: Planning for Execution in Grids," GriPhyN technical report 20(17):12--22.Google Scholar
- P. Couvares, T. Kosar, A. Roy, Jeff Weber, K. Wenger, "Workflow in Condor," Workflows for e-Science, Editors: I.Taylor, E.Deelman, D.Gannon, M.Shields, Springer Press, January 2007 (ISBN: 1-84628-519-4).Google Scholar
- E. Deelman, G. Singha, M. Sua, J. Blythea, Y. Gila, C. Kesselmana, G. Mehtaa, K. Vahia, G. Berrimanb, J. Goodb, A. Laityb, J. Jacobc, D. Katzc, "Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems," Scientific Programming Journal, Vol 13(3), pages 219--237, 2005. Google ScholarDigital Library
- V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A. Rowe, J. Syed, P. Wendel, "Discovery Net: towards a grid of knowledge discovery," KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. August, 2002. Google ScholarDigital Library
- L. Bavoil, S. P. Callahan, P. J. Crossno, J. Freire, C. E. Scheidegger, C. T. Silva, H. T. Vo, "VisTrails: Enabling Interactive Multiple-View Visualizations." Proceedings of IEEE Visualization, pp. 135--142, 2005.Google Scholar
- C. Berkley, E. Jaeger, M. Jones, B. Ludäscher. S. Mock S, "Kepler: An Extensible System for Design and Execution of Scientific Workflows," Proceedings of the The Future of Grid Data Environments, Global Grid Forum 10, 2004.Google Scholar
- T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat. P. Li, "Taverna: A tool for the composition and enactment of bioinformatics workflows," Bioinformatics 20 (17): 3045--3054. doi:10.1093/bioinformatics/bth361. PMID 15201187, 2004. Google ScholarDigital Library
- J. Elhai, A. Taton, J. Massar, J. K. Myers, M. Travers, J. Casey, M. Slupesky, J. Shrager, "BioBIKE: A Web-based, programmable, integrated biological knowledge base," Nucleic Acids Research 37 (Web Server issue): W28--W32. doi:10.1093/nar/gkp354. PMC 2703918. PMID 19433511, 2009.Google Scholar
- R. Pordes et al. "The Open Science Grid," J. Phys. Conf. Ser. 78, 012057.doi:10.1088/1742-6596/78/1/012057, 2007.Google ScholarCross Ref
- Extreme Science and Engineering Discovery Environment (XSEDE). {http://www.xsede.org}.Google Scholar
- Amazon Elastic Compute Cloud (EC2). {http://aws.amazon.com/ec2}.Google Scholar
- FutureGrid. {http://futuregrid.org/}.Google Scholar
- Nimbus Platform. {http://www.nimbusproject.org/}.Google Scholar
- Eucalyptus, Open Source AWS Compatible Private Clouds. {https://www.eucalyptus.com/}.Google Scholar
- J. --S. Vockler, G. Juve, E. Deelman, M. Rynge, G. B. Berriman, "Experiences Using Cloud Computing for A Scientific Workflow Application," Workshop on Scientific Cloud Computing (ScienceCloud), June 2011. Google ScholarDigital Library
- N. Pavlovikj, K. Begcy, S. Behera, M. Campbell, H. Walia, J. S. Deogun, "A Comparison of a Campus Cluster and Open Science Grid Platforms for Protein-Guided Assembly using Pegasus Workflow Management System," 28th IEEE International Parallel and Distributed Processing Symposium: Workshop on High Performance Computational Biology, May 2014.Google Scholar
- Z. Wang, M. Gerstein, M. Snyder, "RNA-Seq: a revolutionary tool for transcriptomics," Nature Reviews Genetics 10 (1): 57--63. doi:10.1038/nrg2484. PMC 2949280. PMID 19015660, 2009.Google Scholar
- D. R. Zerbino, E. Birney, "Velvet: algorithms for de novo short read assembly using de Bruijn graphs," Genome Research 18:821--829.Google ScholarCross Ref
- H. Xiaoqiu, M. Anup, "CAP3: A DNA Sequence Assembly Program," Genome Res. 1999 September; 9(9): 868--877.Google Scholar
- K. Krasileva, V. Buffalo, P. Bailey, S. Pearce, S. Ayling, F. Tabbita, M. Soria, S. Wang, IWGS Consortium, E. Akhunov, C. Uauy, J. Dubcovsky, "Separating homeologs by phasing in the tetraploid wheat transcriptome," Genome Biology 2013, 14:R66 doi:10.1186/gb-2013-14-6-r66.Google Scholar
- S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman, "Basic local alignment search tool," J Mol Biol 1990, 215:403--410.Google Scholar
- Buffalo V: Blast2cap3 software. {https://github.com/vsbuffalo/blast2cap3/}.Google Scholar
- G. Singh, M. --H. Su, K. Vahi, E. Deelman, B. Berriman, J. Good, D. S. Katz, G. Mehta, "Workflow Task Clustering for Best Effort Systems with Pegasus," Mardi Gras Conference, Baton Rouge, LA, January 2008. Google ScholarDigital Library
- Python Programming Language. {http://www.python.org/}.Google Scholar
- Biopython. {http://biopython.org/}.Google Scholar
- P. Mhashilkar, A. Tiradani, B. Holzman, K. Larson, I. Sfiligoi, M. Rynge, "Cloud Bursting with GlideinWMS: Means to satisfy ever increasing computing needs for Scientific Workflows," 20th International Conference on Computing on High Energy and Nuclear Physics (CHEP), October 2013.Google Scholar
- Pegasus 4.3 User Guide. {https://pegasus.isi.edu/wms/docs/latest/pegasus-user-guide.pdf/}.Google Scholar
- Sandhills UNL HPC Cluster. {http://hcc.unl.edu/sandhills/}.Google Scholar
- I. Sfiligoi, F. Würthwein, W. Andrews, J. M. Dost, I. MacNeill, A. McCrea, E. Sheripon, C. W. Murphy, "Operating a production pilot factory serving several scientific domains," J. Phys.: Conf. Ser. 331, 072031, doi:10.1088/1742-6596/331/7/072031.Google ScholarCross Ref
- B. Darrow, "Cycle Computing spins up 50K core Amazon cluster," GigaOm, 2012.Google Scholar
- Z. Ou, H. Zhuang, J. K. Nurminen, A. Ylä-Jääski, P. Hui, "Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2," HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing. Google ScholarDigital Library
- NCBI BioProjects. {http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA191053/}.Google Scholar
- S. Buyske, K. Vahi, E. Deelman, U. Peters, T. Matise, "Conducting Large-Scale Imputation Studies on the Cloud," ASHG 2013, Boston, Masachuessets, 2013.Google Scholar
Index Terms
- Evaluating Distributed Platforms for Protein-Guided Scientific Workflow
Recommendations
Comparing FutureGrid, Amazon EC2, and Open Science Grid for Scientific Workflows
Scientists have many computing infrastructures available to conduct their research, including grids and public or private clouds. This article explores the use of these cyberinfrastructures to execute scientific workflows, an important class of ...
A Distributed Workflow Management System with Case Study of Real-life Scientific Applications on Grids
Next-generation scientific applications feature complex workflows comprised of many computing modules with intricate inter-module dependencies. Supporting such scientific workflows in wide-area networks especially Grids and optimizing their performance ...
A Survey of Data-Intensive Scientific Workflow Management
Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for ...
Comments