ABSTRACT
With the rapid development in recent years of high-throughput technologies in the life sciences, huge amounts of data are being generated and stored in databases. Despite significant advances in computing capacity and performance, an analysis of these large-scale data in a search for biomedically relevant patterns remains a challenging task. Scientific workflow applications support data-mining in more complex scenarios that include many data sources and computational tools, as commonly found in bioinformatics. A scientific workflow application is a holistic unit that defines, executes, and manages scientific applications using different software tools. Existing workflow applications are process- or data- rather than resource-oriented. Thus, they lack efficient computational resource management capabilities, such as those provided by Cloud computing environments. Insufficient computational resources disrupt the execution of workflow applications, wasting time and money. To address this issue, advanced resource monitoring and management strategies are required to determine the resource consumption behaviours of workflow applications for a dynamical allocation and deallocation of resources. In this paper, we present a novel Cloud resource monitoring technique and a knowledge management strategy to manage computational resources for workflow applications in order to guarantee their performance goals and their successful completion. We present the design description of these techniques, demonstrate how they can be applied to scientific workflow applications, and present first evaluation results as a proof of concept.
- ActiveMQ. Messaging and integration pattern provider. http://activemq.apache.org/.Google Scholar
- I. Altintas, C. Berkley, E., M. Jones, B. Ludascher, and S. Mock. Kepler: an extensible system for design and execution of scientific workflows. Scientific and Statistical Database Management, 16th International Conference on, pages 423--424, 2004. Google ScholarDigital Library
- R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6):599--616, 2009. Google ScholarDigital Library
- C. Cantacessi, A. R. Jex, R. S. Hall, N. D. Young, B. E. Campbell, A. Joachim, M. J. Nolan, S. Abubucker, P. W. Sternberg, S. Ranganathan, M. Mitreva, and R. B. Gasser. A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing. Nucleic Acids Research, 38(17):e171, 2010.Google ScholarCross Ref
- M. Comuzzi, C. Kotsokalis, G. Spanoudkis, and R. Yahyapour. Establishing and monitoring SLAs in complex service based systems. In Proceedings of the 7th International Conference on Web Services (ICWS'09), 2009. Google ScholarDigital Library
- E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13:219--237, July 2005. Google ScholarDigital Library
- V. C. Emeakaroha, I. Brandic, M. Maurer, and S. Dustdar. Low level metrics to high level slas - lom2his framework: Bridging the gap between monitored metrics and sla parameters in cloud environments. In 2010 International Conference on High Performance Computing and Simulation (HPCS), pages 48--54, July 2010.Google ScholarCross Ref
- V. C. Emeakaroha, R. N. Calheiros, M. A. S. Netto, I. Brandic, and C. A. F. De Rose. DeSVi: An architecture for detecting SLA violations in cloud computing infrastructures. In Proceedings of the 2nd International ICST Conference on Cloud Computing (CloudComp'10), 2010.Google Scholar
- V. C. Emeakaroha, M. A. S. Netto, R. N. Calheiros, I. Brandic, R. Buyya, , and C. A. F. De Rose. Towards autonomic detection of sla violations in cloud infrastructures. Future Generation Computer Systems, 2011. Google ScholarDigital Library
- S. Ferretti, V. Ghini, F. Panzieri, M. Pellegrini, and E. Turrini. Qos-aware clouds. In 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pages 321--328, july 2010. Google ScholarDigital Library
- A. Goderis, P. Li, and C. Goble. Workflow discovery: the problem, a case study from e-science and a graph-based solution. Web Services, IEEE International Conference on, 0:312--319, 2006. Google ScholarDigital Library
- J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8):R86, 2010.Google ScholarCross Ref
- B. D. Halligan, J. F. Geiger, A. K. Vallejos, A. S. Greene, and S. N. Twigger. Low cost, scalable proteomics data analysis using amazon's cloud computing services and open source search algorithms. J. Proteome Res., 8(6):3148 -- 3153, 2009.Google ScholarCross Ref
- D. Hollingsworth. The workflow reference model. In Technical Report (WFMC- TC00--1003) Workflow Management Coalition, 1995.Google Scholar
- D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(suppl 2):W729--W732, 1 July 2006.Google ScholarCross Ref
- JMS. Java messaging service. http://java.sun.com/products/jms/.Google Scholar
- J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41--50, 2003. Google ScholarDigital Library
- B. Koller and L. Schubert. Towards autonomous SLA management using a proxy-like approach. Multiagent Grid Systems, 3(3):313--325, 2007. Google ScholarDigital Library
- D. P. Kreil. From general scientific workflows to specific sequence analysis applications: The study of compositionally biased proteins. PhD thesis, 2001.Google Scholar
- P. P. Łabaj, G. G. Leparc, B. E. Linggi, L. M. Markillie, H. S. Wiley, and D. P. Kreil. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics, 27(13):i383--i391, 2011. Google ScholarDigital Library
- B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25, 2009.Google ScholarCross Ref
- H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078--9, 2009. Google ScholarDigital Library
- B. Linke, R. Giegerich, and A. Goesmann. Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics, 27(7):903--911, 2011. Google ScholarDigital Library
- M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: Design, implementation and experience. Parallel Computing, 30(7):817--840, 2004.Google ScholarCross Ref
- M. Maurer, I. Brandic, V. C. Emeakaroha, and S. Dustdar. Towards knowledge management in self-adaptable clouds. In IEEE 2010 Fourth International Workshop of Software Engineering for Adaptive Service-Oriented Systems, Miami, USA, 2010.Google ScholarDigital Library
- M. Maurer, I. Brandic, and R. Sakellariou. Simulating autonomic sla enactment in clouds using case based reasoning. In ServiceWave 2010: Proceedings of the 2010 ServiceWave Conference, Ghent, Belgium, 2010.Google Scholar
- M. Maurer, I. Brandic, and R. Sakellariou. Enacting slas in clouds using rules. In Proceedings of Euro-par 2011, 2011. Google ScholarDigital Library
- N. Merchant, J. Hartman, S. Lowry, A. Lenards, D. Lowenthal, and E. Skidmore. Leveraging cloud infrastructure for life science research laboratories: A generalized view. In International Workshop on Cloud Computing at OOPSLA09, Orlando, USA, 2009.Google Scholar
- D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The Eucalyptus open-source cloud-computing system. In Proceedings of the 9th International Symposium on Cluster Computing and the Grid (CCGRID'09), 2009. Google ScholarDigital Library
- E. Pennisi. Will computers crash genomics? Science, 331(6018):666--668, 2011.Google ScholarCross Ref
- G. E. Robinson, J. A. Banks, D. K. Padilla, W. W. Burggren, C. S. Cohen, C. F. Delwiche, V. Funk, H. E. Hoekstra, E. D. Jarvis, L. Johnson, M. Q. Martindale, C. M. Rio, M. Medina, D. E. Salt, S. Sinha, C. Specht, K. Strange, J. E. Strassmann, B. J. Swalla, and L. Tomanek. Empowering 21st century biology. BioScience, 60(11):923--930, 2010.Google ScholarCross Ref
- B. Rochwerger, D. Breitgand, E. Levy, A. Galis, K. Nagin, L. Llorente, R. Montero, Y. Wolfsthal, E. Elmroth, J. Caceres, M. Ben-Yehuda, W. Emmerich, and F. Galan. The RESERVOIR model and architecture for open federated cloud computing. IBM Journal of Research and Development, 53(4):Paper 4, 2009. Google ScholarDigital Library
- P. Romano. Automation of in-silico data analysis processes through workflow management systems. Briefings in Bioinformatics, 9(1):57--68, October 2007.Google ScholarCross Ref
- D. Smedley, M. A. Swertz, K. Wolstencroft, G. Proctor, M. Zouberakis, J. Bard, J. M. Hancock, and P. Schofield. Solutions for data integration in functional genomics: a critical assessment and case study. Briefings in Bioinformatics, 9(6):532--544, September 2008.Google ScholarCross Ref
- C. Stoegerer, I. Brandic, V. C. Emeakaroha, W. Kastner, and T. Novak. Applying availability slas to traffic management systems. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC 2011), 2011.Google ScholarCross Ref
- F. Tang, C. L. Chua, L.-Y. Ho, Y. P. Lim, P. Issac, and A. Krishnan. Wildfire: distributed, grid-enabled workflow construction and execution. BMC Bioinformatics, 6(69), 2005.Google Scholar
- A. Tiwari and A. K. Sekhar. Workflow based framework for life science informatics. Computational Biology and Chemistry, 31(5--6):305--319, 2007. Google ScholarDigital Library
- C. Trapnell, L. Pachter, and S. L. Salzberg. Tophat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105--1111, 2009. Google ScholarDigital Library
Index Terms
- Optimizing bioinformatics workflows for data analysis using cloud management techniques
Recommendations
Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds
The rapid advancements in recent years of high-throughput technologies in the life sciences are facilitating the generation and storage of huge amount of data in different databases. Despite significant developments in computing capacity and performance,...
Brokering multi-grid workflows in the P-GRADE portal
Euro-Par'06: Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processingGrid computing has gone through some generations and as a result only a few widely used middleware architectures remain. The Globus Toolkit is the most widespread middleware in most of the current production grid systems, but the LCG-2 middleware ...
Automatically Composed Workflows for Grid Environments
Grid computing provides key infrastructure for distributed problem solving in dynamic virtual organizations. Many scientific projects have adopted grid computing, and industrial interest in itis rising rapidly. However, grids are still the domain of a ...
Comments