Elsevier

Future Generation Computer Systems

Volume 75, October 2017, Pages 228-238
Future Generation Computer Systems

A characterization of workflow management systems for extreme-scale applications

https://doi.org/10.1016/j.future.2017.02.026Get rights and content

Highlights

  • Design requirements of workflow applications and systems in the extreme-scale.

  • Survey and classification of 15 popular workflow engines.

  • Research gaps between existing WMS and the desired extreme-scale WMS.

Abstract

Automation of the execution of computational tasks is at the heart of improving scientific productivity. Over the last years, scientific workflows have been established as an important abstraction that captures data processing and computation of large and complex scientific applications. By allowing scientists to model and express entire data processing steps and their dependencies, workflow management systems relieve scientists from the details of an application and manage its execution on a computational infrastructure. As the resource requirements of today’s computational and data science applications that process vast amounts of data keep increasing, there is a compelling case for a new generation of advances in high-performance computing, commonly termed as extreme-scale computing, which will bring forth multiple challenges for the design of workflow applications and management systems. This paper presents a novel characterization of workflow management systems using features commonly associated with extreme-scale computing applications. We classify 15 popular workflow management systems in terms of workflow execution models, heterogeneous computing environments, and data access methods. The paper also surveys workflow applications and identifies gaps for future research on the road to extreme-scale workflows and management systems.

Introduction

Scientific workflows are an important abstraction for the composition of complex applications in a broad range of domains, such as astronomy, bioinformatics, climate science, and others [1]. Workflows provide automation that increases the productivity of scientists when conducting computation-based studies. Automation enables adaptation to the changing application needs and resource (compute, data, network) behavior. As workflows have been adopted by a number of scientific communities, they are becoming more complex and need more sophisticated workflow management capabilities. A workflow now can analyze terabyte-scale data sets, be composed of a million individual tasks, and can process data streams, files, and data placed in object stores. The computations can be single core workloads, loosely coupled computations (like MapReduce), or tightly coupled (like MPI-based parallel programs) all within a single workflow, and can run in dispersed cyberinfrastructures [1], [2].

In recent years, numerous workflow management systems (WMSs) have been developed to manage the execution of diverse workflows on heterogeneous computing resources [3], [4], [5], [6], [7], [8], [9]. As user communities adopt and evolve WMSs to fit their own needs, many of the features and capabilities that were once common to most WMSs have become too distinct to share across systems. For example, Taverna [8] and Galaxy [9] support advanced graphical user interfaces for workflow composition, making them suitable for bioinformatics researchers with little programming experience. Other systems, such as DAGMan [10] and Pegasus [3] offer scalability, robustness, and planning for heterogeneous high-throughput computation execution. For a new user, choosing the right WMS can be problematic simply because there are so many different WMSs and the selection criteria may not be obvious.

To address this problem, several recent surveys [11], [12], [13], [14], [15], [16], [17], [18] have been compiled to help users compare and contrast different WMSs based on certain key properties and capabilities of WMSs. These surveys focused mostly on the characterization of the following properties: support for conditional structures (e.g., if and switch statements, while loops, etc.) [12], workflow composition (e.g., graphical user interface, command line, or web portals) [13], [14], [15], [16], workflow design (DAG or Non-DAG) [16], [17], types of parallelism (e.g., task, data, pipeline, or hybrid parallelism) [14], [16], [17], computational infrastructure (e.g., cluster, grid, and clouds) [12], [14], [15], [16], workflow scheduling (e.g., status, job queue, adaptive) [14], [15], [16], [17], [18], workflow QoS constraints (e.g., time, cost, reliability, security, etc.) [17], and fault-tolerance and workflow optimizations (e.g., task-level, workflow-level, etc.) [15], [16], [17].

Unfortunately, the above characterization properties do not sufficiently address the following question that is on the mind of many computational scientists: “Are WMSs ready to support extreme-scale applications?” We define extreme-scale applications as scientific applications that will utilize extreme-scale computing to solve vastly more accurate predictive models than before and enable the analysis of massive quantities of data [19], [20]. It is expected that the requirements of such applications will exceed the capabilities of current leading-edge high-performance computing (HPC) systems. Examples of extreme-scale applications include: first-principles understanding of the properties of fission and fusion reactions; adaptation to regional climate changes such as sea-level rise, drought and flooding, and severe weather patterns; and innovative designs for cost-effective renewable energy resources such as batteries, catalysts, and biofuels [19].

Extreme-scale computing that includes planned US Department of Energy exascale systems [21] will bring forth multiple challenges for the design of workflow applications and management systems. The next-generation of HPC architectures is shifting away from traditional homogeneous systems to much more heterogeneous ones. Due to the severe energy constraints, data movement will be constrained, both internode and on/off the system, and users will be required to manage deep memory hierarchies and multi-stage storage systems [20], [22]. There will be an increased reliance on in situ data management, analysis and visualization, occurring in parallel with the simulation [23], [24]. These in situ processing steps need to be captured to provide context and increase reproducibility.

In addition, as the scientific community prepares for extreme-scale computing, big data analytics is becoming an essential part of the scientific process for insights and discoveries [25]. As big data applications became mainstream in recent years, new systems have been developed to handle the data processing. These include Hadoop [26], a MapReduce-based system for parallel data processing, Apache Spark [27], a system for concurrent processing of heterogeneous data streams, and Apache Storm [28] for real-time streaming data processing. Integrating big data analytics with HPC simulations is a major challenge that requires new workflow management capabilities at the extreme-scale.

In this paper we present a novel characterization of WMSs focused specifically on extreme-scale workflows, using the following properties: (1) workflow execution models, (2) heterogeneous computing environments, and (3) data access methods. Associated with each property is a set of features that can be used to classify a WMS. To evaluate these properties, we select 15 state-of-the-art WMSs based on their broad and active usage in the scientific community, as well as the fact that they have been part of previous surveys. Through a detailed analysis using available publications and other documents, such as project webpages and code manuals, we derive the classification of these WMSs using our features for extreme-scale applications. Our primary contribution in this work is the distillation of all the available information into an easy-to-use lookup table that contains a feature checklist for each WMS. This table represents a snapshot of the state-of-the-art, and we envision it to evolve and grow based on future research in WMSs.

The remainder of this paper is structured as follows. Section 2 presents a background overview of WMSs in general and previous work on characterizing workflows and WMSs. Section 3 describes the two types of extreme-scale workflows that motivate this work. Section 4 presents the three properties for characterizing WMSs for extreme-scale applications, along with their associated features. Section 5 classifies 15 popular WMSs based on these features, and Section 6 describes their current usage in the scientific community. Section 7 identifies gaps and directions for future research on the road to extreme-scale workflows. Finally, Section 8 concludes the paper.

Section snippets

Scientific workflows

The term workflow refers to the automation of a process, during which data is processed by different tasks. A WMS aids in the automation of these processes, by managing data and the execution of the application on a computational infrastructure. Scientific workflows allow scientists to easily model and express all the data processing tasks and their dependencies, typically as a directed acyclic graph (DAG), whose nodes represent workflow tasks that are linked via dataflow edges, thus

Extreme-scale workflows

As mentioned earlier, extreme-scale computing is on the horizon and will bring forth multiple challenges for the design of workflow applications and management systems. To illustrate these challenges, we present two types of extreme-scale workflows that are being actively developed in the computational science community to meet the needs.

Characterization of workflow management systems

As described in Section 2, there are a number of different ways to characterize WMSs, from the interfaces they present to the users, down to the types of provenance records they provide. With the growing complexity of extreme-scale applications and the ever-increasing heterogeneity of computing capabilities, the list of WMS features for computational science is getting longer and requires a fresh perspective to help users choose. In this paper, we focus on the features of WMSs that are most

Classification of workflow management systems

In the past two decades, a number of WMSs have been developed to automate the computational and data management tasks, the provisioning of the needed resources, as well as different execution models required by today?s scientific applications. Based on the characterization of WMS presented in the previous section, Table 1 summarizes the features supported by the most popular WMSs described below in alphabetical order. We chose these WMSs for classification because they are widely and actively

Applications of workflow management systems

Given that many of the anticipated extreme-scale applications will have their roots in today’s applications, it is instructive to note how WMSs are currently being used. After a thorough examination of the literature, we identified six categories of applications with notable usage of WMSs. In this section, we describe these application categories, along with specific applications from various scientific domains and the WMSs that are used to support them. We also point out the issues that may

Future research challenges

Based on the characterization and classification of the state-of-the-art WMSs presented in this paper, we identify several challenges that must be addressed in order to meet the needs of extreme-scale applications. These research challenges are:

  • Data sharing for in situ workflows: An important aspect of in situ integration is the exchange of data among the simulation and analytics. In distributed workflows, data is communicated primarily via named files. In situ data is often communicated via

Conclusion

In this paper, we presented a novel characterization of state-of-the-art workflow management systems (WMSs) and the applications they support designed specifically for extreme-scale workflows. As scientific applications scale up in complexity and size, and as HPC architectures become more heterogeneous and energy constrained, WMSs need to evolve and become more sophisticated. To understand this better, we surveyed and classified workflow properties and management systems in terms of workflow

Acknowledgments

This work was performed under the auspices of the US Department of Energy (DOE) by Lawrence Livermore National Laboratory (LLNL) under Contract DE-AC52-07NA27344 (LLNL-JRNL-706700). This work was partially funded by the Laboratory Directed Research and Development Program at LLNL under project 16-ERD-036; by the Scottish Informatics and Computer Science Alliance (SICSA) with the Postdoctoral and Early Career Researcher Exchanges (PECE) fellowship; and by DOE under Contract #DESC0012636,

Rafael Ferreira da Silva is a Research Assistant Professor in the Department of Computer Science at University of Southern California, and a Computer Scientist in the Science Automation Technologies group at the USC Information Sciences Institute. His research focuses on the efficient execution of scientific workflows on heterogeneous distributed systems (e.g., clouds, grids, and supercomputers), computational reproducibility, and Data Science – workflow performance analysis, user behavior in

References (85)

  • FilgueiraR. et al.

    dispel4py: A python framework for data-intensive scientific computing

    Int. J. High Perform. Comput. Appl.

    (2016)
  • M. Albrecht, P. Donnelly, P. Bui, D. Thain, Makeflow: A portable abstraction for data intensive computing on clusters,...
  • JainA. et al.

    Fireworks: a dynamic workflow system designed for high-throughput applications

    Concurr. Comput. Pract. Exp.

    (2015)
  • FahringerT. et al.

    Askalon: A development and grid computing environment for scientific workflows

  • K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P....
  • D. Blankenberg, G.V. Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko, J. Taylor, Galaxy: a web-based...
  • J. Frey, Condor Dagman: Handling Inter-Job Dependencies, University of Wisconsin, Dept. of Computer Science, Tech....
  • LiewC.S. et al.

    Scientific workflows: Moving across paradigms

    ACM Comput. Surv. (CSUR)

    (2016)
  • AdhiantoL. et al.

    HPCToolkit: Tools for performance analysis of optimized parallel programs

    Concurr. Comput. Pract. Exp.

    (2010)
  • BahsiE.M. et al.

    Conditional workflow management: A survey and analysis

    Sci. Program.

    (2007)
  • M. Bux, U. Leser, Parallelization in scientific workflow management systems, arXiv preprint arXiv:1303.7195...
  • LiuJ. et al.

    A survey of data-intensive scientific workflow management

    J. Grid Comput.

    (2015)
  • YuJ. et al.

    A taxonomy of workflow management systems for grid computing

    J. Grid Comput.

    (2005)
  • BarkerA. et al.

    Scientific workflow: a survey and research directions

  • The Opportunities and Challenges of Exascale Computing, ASCAC Subcommittee Report, 2010....
  • DongarraJ.

    With extreme scale computing the rules have changed

  • Report on the ASCR Workshop on Architectures I: Exascale and Beyond: Gaps in Research, Gaps in our Thinking, 2011....
  • Report out from the Exascale Research Planning Workshop Working Session on Data Management, Visualization, IO and...
  • Scientific Discovery at the Exascale: Report from the DOE ASCR 2011 Work shop on Exascale Data Management, Analysis and...
  • MaK.-L.

    In-situ visualization at extreme scale: Challenges and opportunities

    IEEE Comput. Graph. Appl.

    (2009)
  • ReedD.A. et al.

    Exascale computing and big data

    Commun. ACM

    (2015)
  • K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed file system, in: Mass Storage Systems and...
  • ZahariaM. et al.

    Spark: Cluster computing with working sets

    HotCloud

    (2010)
  • Apache storm, https://storm.incubator.apache.org...
  • A. Spinuso, R. Filgueira, M. Atkinson, A. Gemuend, Visualisation methods for large provenance collections in...
  • G. Juve, B. Tovar, R. Ferreira da Silva, D. Król, D. Thain, E. Deelman, W. Allcock, M. Livny, Practical resource...
  • I. Santana-Perez, M.S. Pérez-Hernández, Towards reproducibility in scientific workflows: An infrastructure-based...
  • RoureD.D. et al.

    The design and realisation of the myexperiment virtual research environment for social sharing of workflows

    Future Gener. Comput. Syst.

    (2008)
  • K. Belhajjame, J. Zhao, D. Garijo, M. Gamble, K. Hettne, R. Palma, E. Mina, O. Corcho, J.M. Gómez-Pérez, S. Bechhofer,...
  • BerryM.W.

    Scientific workload characterization by loop-based analyses

    ACM SIGMETRICS Perform. Eval. Rev.

    (1992)
  • L. Ramakrishnan, D. Gannon, A Survey of Distributed Workflow Characteristics and Resource Requirements, Tech. Rep....
  • OstermannS. et al.

    On the characteristics of grid workflows

  • Cited by (129)

    • Enabling machine learning-ready HPC ensembles with Merlin

      2022, Future Generation Computer Systems
      Citation Excerpt :

      Recently, the workflow community [9] has begun to examine these technologies with an eye towards their applicability to scientific computing at the exascale. In this section, we briefly survey some of those existing technologies, particularly those at the forefront of scientific computing; for a more complete overview, see Ref. [10]. Pegasus [11] and FireWorks [12] are full workflow systems which provide task tracking, generalized scheduling, and Python APIs for creating workflows programmatically.

    • StarShip: Mitigating I/O bottlenecks in serverless computing for scientific Workflows

      2024, Proceedings of the ACM on Measurement and Analysis of Computing Systems
    • Enabling Agile Analysis of I/O Performance Data with PyDarshan

      2023, ACM International Conference Proceeding Series
    View all citing articles on Scopus

    Rafael Ferreira da Silva is a Research Assistant Professor in the Department of Computer Science at University of Southern California, and a Computer Scientist in the Science Automation Technologies group at the USC Information Sciences Institute. His research focuses on the efficient execution of scientific workflows on heterogeneous distributed systems (e.g., clouds, grids, and supercomputers), computational reproducibility, and Data Science – workflow performance analysis, user behavior in HPC/HTC, and citation analysis (for publications). Dr. Ferreira da Silva received his Ph.D. in Computer Science from INSA-Lyon, France, in 2013. For more information, please visit http://www.rafaelsilva.com.

    Rosa Filgueira Ph.D., has recently joined to the British Geological Survey (BGS) as a Senior Data Scientist. Previously, she was working as a Research Associate at the Data Intensive Research Group of the University Edinburgh and as a Research and Teaching Assistant at the Computer Architecture Group of University Carlos III Madrid. Her research expertise is on improving the HPC applications’ scalability and performance having contributed to several european and national projects in hazard forecasting and parallel processing. During the VERCE project she contributed to the design and optimization of dispel4py and pioneered several dispel4py applications. Currently, she is leading requirements capture for the ENVRIplus project (funded by EU Horizon2020) delivering common data functionality for 22 pan-European Research Infrastructures.

    Ilia Pietri completed her Ph.D. degree in Computer Science at the University of Manchester, UK, in 2015. She received her B.Sc. degree in Informatics and Telecommunications and M.Sc. degree in Economics and Administration of Telecommunication Networks from the National and Kapodistrian University of Athens, Greece, in 2008 and 2010, respectively. Currently, she is a PostDoc researcher at University of Athens, Greece. Her research interests include resource management and cost efficiency in distributed systems, such as clouds.

    Ming Jiang is a computer scientist in the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL). His current research focuses on integrating HPC simulations with Big Data analytics. He is the principal investigator (PI) for Alkemi: A Data Analytics Approach to Improving Simulation Workflows. His research interests include data-intensive computing, flow visualization, feature detection, image processing, out-of-core techniques, and multiresolution analysis. He received his Ph.D. degree in Computer Science and Engineering from The Ohio State University (OSU) in 2005. His dissertation focused on developing a feature-based approach to visualizing and mining large-scale simulation data.

    Rizos Sakellariou is an Senior Lecturer ( Associate Professor) in Computer Science in the School of Computer Science, University of Manchester. His research focuses on the development of techniques that could be used to produce efficient software for large-scale computing systems, which make use of some form of concurrency to handle large-scale data and/or computation, with efficient resource allocation and sharing being a focal point. He has published over 100 research papers, which have attracted more than 4000 Google Scholar citations and has been involved with more than 120 international conferences and workshops. Dr. Sakellariou received his Ph.D. in Computer Science from the University of Manchester in 1997 for a thesis on loop parallelization.

    Ewa Deelman is a Research Professor at the USC Computer Science Department and a Research Director at the USC Information Sciences Institute. Dr. Deelman’s research interests include the design and exploration of collaborative, distributed scientific environments, with particular emphasis on workflow management as well as the management of large amounts of data and metadata. At ISI, Dr. Deelman is leading the Pegasus project, which designs and implements workflow mapping techniques for large-scale applications running in distributed environments. Pegasus is being used today in a number of scientific disciplines, enabling researches to formulate complex computations in a declarative way. Dr. Deelman received her Ph.D. in Computer Science from the Rensselaer Polytechnic Institute in 1997. Her thesis topic was in the area of parallel discrete event simulation, where she applied parallel programming techniques to the simulation of the spread of Lyme disease in nature.

    View full text