skip to main content
10.1145/2110497.2110503acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimizing bioinformatics workflows for data analysis using cloud management techniques

Published:14 November 2011Publication History

ABSTRACT

With the rapid development in recent years of high-throughput technologies in the life sciences, huge amounts of data are being generated and stored in databases. Despite significant advances in computing capacity and performance, an analysis of these large-scale data in a search for biomedically relevant patterns remains a challenging task. Scientific workflow applications support data-mining in more complex scenarios that include many data sources and computational tools, as commonly found in bioinformatics. A scientific workflow application is a holistic unit that defines, executes, and manages scientific applications using different software tools. Existing workflow applications are process- or data- rather than resource-oriented. Thus, they lack efficient computational resource management capabilities, such as those provided by Cloud computing environments. Insufficient computational resources disrupt the execution of workflow applications, wasting time and money. To address this issue, advanced resource monitoring and management strategies are required to determine the resource consumption behaviours of workflow applications for a dynamical allocation and deallocation of resources. In this paper, we present a novel Cloud resource monitoring technique and a knowledge management strategy to manage computational resources for workflow applications in order to guarantee their performance goals and their successful completion. We present the design description of these techniques, demonstrate how they can be applied to scientific workflow applications, and present first evaluation results as a proof of concept.

References

  1. ActiveMQ. Messaging and integration pattern provider. http://activemq.apache.org/.Google ScholarGoogle Scholar
  2. I. Altintas, C. Berkley, E., M. Jones, B. Ludascher, and S. Mock. Kepler: an extensible system for design and execution of scientific workflows. Scientific and Statistical Database Management, 16th International Conference on, pages 423--424, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6):599--616, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Cantacessi, A. R. Jex, R. S. Hall, N. D. Young, B. E. Campbell, A. Joachim, M. J. Nolan, S. Abubucker, P. W. Sternberg, S. Ranganathan, M. Mitreva, and R. B. Gasser. A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing. Nucleic Acids Research, 38(17):e171, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Comuzzi, C. Kotsokalis, G. Spanoudkis, and R. Yahyapour. Establishing and monitoring SLAs in complex service based systems. In Proceedings of the 7th International Conference on Web Services (ICWS'09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13:219--237, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. C. Emeakaroha, I. Brandic, M. Maurer, and S. Dustdar. Low level metrics to high level slas - lom2his framework: Bridging the gap between monitored metrics and sla parameters in cloud environments. In 2010 International Conference on High Performance Computing and Simulation (HPCS), pages 48--54, July 2010.Google ScholarGoogle ScholarCross RefCross Ref
  8. V. C. Emeakaroha, R. N. Calheiros, M. A. S. Netto, I. Brandic, and C. A. F. De Rose. DeSVi: An architecture for detecting SLA violations in cloud computing infrastructures. In Proceedings of the 2nd International ICST Conference on Cloud Computing (CloudComp'10), 2010.Google ScholarGoogle Scholar
  9. V. C. Emeakaroha, M. A. S. Netto, R. N. Calheiros, I. Brandic, R. Buyya, , and C. A. F. De Rose. Towards autonomic detection of sla violations in cloud infrastructures. Future Generation Computer Systems, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Ferretti, V. Ghini, F. Panzieri, M. Pellegrini, and E. Turrini. Qos-aware clouds. In 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pages 321--328, july 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Goderis, P. Li, and C. Goble. Workflow discovery: the problem, a case study from e-science and a graph-based solution. Web Services, IEEE International Conference on, 0:312--319, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8):R86, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  13. B. D. Halligan, J. F. Geiger, A. K. Vallejos, A. S. Greene, and S. N. Twigger. Low cost, scalable proteomics data analysis using amazon's cloud computing services and open source search algorithms. J. Proteome Res., 8(6):3148 -- 3153, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. D. Hollingsworth. The workflow reference model. In Technical Report (WFMC- TC00--1003) Workflow Management Coalition, 1995.Google ScholarGoogle Scholar
  15. D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(suppl 2):W729--W732, 1 July 2006.Google ScholarGoogle ScholarCross RefCross Ref
  16. JMS. Java messaging service. http://java.sun.com/products/jms/.Google ScholarGoogle Scholar
  17. J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41--50, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Koller and L. Schubert. Towards autonomous SLA management using a proxy-like approach. Multiagent Grid Systems, 3(3):313--325, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. P. Kreil. From general scientific workflows to specific sequence analysis applications: The study of compositionally biased proteins. PhD thesis, 2001.Google ScholarGoogle Scholar
  20. P. P. Łabaj, G. G. Leparc, B. E. Linggi, L. M. Markillie, H. S. Wiley, and D. P. Kreil. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics, 27(13):i383--i391, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  22. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078--9, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Linke, R. Giegerich, and A. Goesmann. Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics, 27(7):903--911, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: Design, implementation and experience. Parallel Computing, 30(7):817--840, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  25. M. Maurer, I. Brandic, V. C. Emeakaroha, and S. Dustdar. Towards knowledge management in self-adaptable clouds. In IEEE 2010 Fourth International Workshop of Software Engineering for Adaptive Service-Oriented Systems, Miami, USA, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Maurer, I. Brandic, and R. Sakellariou. Simulating autonomic sla enactment in clouds using case based reasoning. In ServiceWave 2010: Proceedings of the 2010 ServiceWave Conference, Ghent, Belgium, 2010.Google ScholarGoogle Scholar
  27. M. Maurer, I. Brandic, and R. Sakellariou. Enacting slas in clouds using rules. In Proceedings of Euro-par 2011, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Merchant, J. Hartman, S. Lowry, A. Lenards, D. Lowenthal, and E. Skidmore. Leveraging cloud infrastructure for life science research laboratories: A generalized view. In International Workshop on Cloud Computing at OOPSLA09, Orlando, USA, 2009.Google ScholarGoogle Scholar
  29. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The Eucalyptus open-source cloud-computing system. In Proceedings of the 9th International Symposium on Cluster Computing and the Grid (CCGRID'09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Pennisi. Will computers crash genomics? Science, 331(6018):666--668, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  31. G. E. Robinson, J. A. Banks, D. K. Padilla, W. W. Burggren, C. S. Cohen, C. F. Delwiche, V. Funk, H. E. Hoekstra, E. D. Jarvis, L. Johnson, M. Q. Martindale, C. M. Rio, M. Medina, D. E. Salt, S. Sinha, C. Specht, K. Strange, J. E. Strassmann, B. J. Swalla, and L. Tomanek. Empowering 21st century biology. BioScience, 60(11):923--930, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  32. B. Rochwerger, D. Breitgand, E. Levy, A. Galis, K. Nagin, L. Llorente, R. Montero, Y. Wolfsthal, E. Elmroth, J. Caceres, M. Ben-Yehuda, W. Emmerich, and F. Galan. The RESERVOIR model and architecture for open federated cloud computing. IBM Journal of Research and Development, 53(4):Paper 4, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Romano. Automation of in-silico data analysis processes through workflow management systems. Briefings in Bioinformatics, 9(1):57--68, October 2007.Google ScholarGoogle ScholarCross RefCross Ref
  34. D. Smedley, M. A. Swertz, K. Wolstencroft, G. Proctor, M. Zouberakis, J. Bard, J. M. Hancock, and P. Schofield. Solutions for data integration in functional genomics: a critical assessment and case study. Briefings in Bioinformatics, 9(6):532--544, September 2008.Google ScholarGoogle ScholarCross RefCross Ref
  35. C. Stoegerer, I. Brandic, V. C. Emeakaroha, W. Kastner, and T. Novak. Applying availability slas to traffic management systems. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC 2011), 2011.Google ScholarGoogle ScholarCross RefCross Ref
  36. F. Tang, C. L. Chua, L.-Y. Ho, Y. P. Lim, P. Issac, and A. Krishnan. Wildfire: distributed, grid-enabled workflow construction and execution. BMC Bioinformatics, 6(69), 2005.Google ScholarGoogle Scholar
  37. A. Tiwari and A. K. Sekhar. Workflow based framework for life science informatics. Computational Biology and Chemistry, 31(5--6):305--319, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. Trapnell, L. Pachter, and S. L. Salzberg. Tophat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105--1111, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing bioinformatics workflows for data analysis using cloud management techniques

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WORKS '11: Proceedings of the 6th workshop on Workflows in support of large-scale science
            November 2011
            154 pages
            ISBN:9781450311007
            DOI:10.1145/2110497

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 14 November 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate30of54submissions,56%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader