A novel general framework for automatic and cost-effective handling of recoverable temporal violations in scientific workflow systems

https://doi.org/10.1016/j.jss.2010.10.027Get rights and content

Abstract

Due to the complex nature of scientific workflow environments, temporal violations often take place and may severely reduce the timeliness of the execution's results. To handle temporal violations in an automatic and cost-effective fashion, two interdependent fundamental issues viz. the definition of fine-grained recoverable temporal violations and the design of light-weight effective exception handling strategies need to be resolved. However, most existing works study them separately without defining a comprehensive framework. To address such a problem, with the probability based temporal consistency model which defines the range of recoverable temporal violations, a novel general automatic and cost-effective exception handling framework is proposed in this paper where fine-grained temporal violations are defined based on the empirical function for the capability lower bounds of the exception handling strategies. To serve as a representative case study, a concrete example exception handling framework which consists of three levels of fine-grained temporal violations and their corresponding exception handling strategies is presented. The effectiveness of the example framework is evaluated by large scale simulation experiments conducted in the SwinDeW-G scientific grid workflow system. The experimental results demonstrate that the example framework can significantly reduce the overall average violation rates of local temporal constraints and global temporal constraints to 0.127% and 0.167% respectively.

Introduction

Scientific workflow systems are a type of workflow management systems aiming at supporting complex scientific processes in many e-science applications such as climate modelling, disaster recovery simulation, astrophysics and high energy physics (Deelman et al., 2008, Taylor et al., 2007). Scientific workflow systems can also be seen as a type of high-level middleware services for high performance computing infrastructures such as cluster, grid, peer-to-peer (p2p) or cloud computing (Buyya et al., 2009, Foster and Kesselman, 2004, Kim et al., 2007, Yang et al., 2007). In recent years, due to the growing demand for high performance computing infrastructures and large scale distributed and collaborative e-science applications, scientific workflow systems have been attracting increasing interests from distributed and parallel system researchers in the area of High Performance Computing (HPGC, 2009, PDSEC, 2009) and software engineering researchers in the area of Software Engineering for Computational Science and Engineering (Chen and Yang, in press, SECES, 2008). One of the common research issues is how to deliver satisfactory workflow QoS (quality of service), i.e. how to satisfy workflow QoS constraints such as the constraints on time, cost, fidelity, reliability and security (Son and Kim, 2001, Yu and Buyya, 2005). Among them, time is one of the basic measurements for system and software performance and hence attracts many researchers in the workflow area (van der Aalst et al., 2000, Chen and Yang, 2008, Duan et al., 2009, Eder et al., 1999, Li et al., 2004, Yu and Buyya, 2005, Zhuge et al., 2001).

In reality, a scientific workflow and its workflow segments are normally subject to specific temporal constraints such as global temporal constraints (deadlines) for workflow instances, and local temporal constraints (milestones) for workflow segments, in order to achieve predefined scientific goals on schedule (Li et al., 2004, Zeng et al., 2008). Otherwise, the timeliness of its execution results will be significantly deteriorated. For example, a daily weather forecast scientific workflow has to be finished before the broadcasting of the weather forecast programme everyday at, for instance, 6:00 pm. Meanwhile, given the large number of data and computation intensive activities for scientific investigation purposes, scientific workflows are usually deployed on distributed high performance infrastructures such as grid and cloud. Therefore, to deliver satisfactory temporal QoS, the violations of both local temporal constraints (or local violations for short) and global temporal constraints (or global violations for short), need to be proactively detected and handled (Zhuge et al., 2001). Recent studies on temporal verification in scientific workflows mainly focus on runtime checkpoint selection (Chen and Yang, in press) and multiple-state based temporal verification (Chen and Yang, 2007) which can deal with the monitoring of temporal consistency states and the detection of potential temporal violations. However, a significant follow-up issue is how to handle those temporal violations. Till date, work on such an issue is still in its infancy. However, it must be properly addressed so as to guarantee high success rates for on-time completion of scientific workflows. Specifically, two fundamental requirements for handling temporal violations, automation and cost-effectiveness, need to be considered.

  • (1)

    Automation. Due to the complex nature of scientific applications and their distributed running environments such as grid and cloud, a large number of temporal violations may often be expected in scientific workflows. Besides, scientific workflow systems are designed to be highly automatic to conduct large scale scientific processes, human interventions which are normally of low efficiency should be avoided as much as possible, especially during workflow runtime (Deelman et al., 2008). Therefore, similar to dynamic checkpoint selection and temporal verification strategies (Chen and Yang, in press), handling strategies are required to automatically tackle a large number of temporal violations and relieve users from the heavy workload of handling those exceptions.

  • (2)

    Cost-effectiveness. The purpose of handling temporal violations is to reduce, or ideally remove, the delays of workflow execution by exception handling strategies with the sacrifice of additional cost which consists of both monetary cost and time overheads. Conventional exception handling strategies for temporal violations, such as resource recruitment and workflow restructure, are usually very expensive (Buhr and Mok, 2000, Hagen and Alonso, 2000, Prodan and Fahringer, 2008, Russell et al., 2006a). The cost for recruiting new resources (e.g. the cost for service discovery and deployment, the cost for data storage and transfer) is normally very large during workflow runtime in distributed computing environments (Prodan and Fahringer, 2008). As for workflow restructure, it is usually realised by the amendment of local workflow segments or temporal QoS contracts, i.e. modifying scientific workflow specifications by human decision makers (Liu et al., 2008b). However, due to budget (i.e. monetary cost) limits and temporal constraints, these heavy-weight strategies (with large monetary cost and/or time overheads) are usually too costly to be practical. To avoid these heavy-weight strategies, recoverable violations (in comparison to severe temporal violations which can be regarded as non-recoverable in practice) need to be identified first and then handled by light-weight strategies (with small monetary cost and/or time overheads) in a cost-effective fashion.

Given the requirement of Automation, exception handling strategies need to be designed to handle temporal violations in an automatic fashion without human interventions. Meanwhile, since most strategies have their limits in the capability of recovering temporal violations, different handling strategies are normally only effective for a range of temporal violations with limited amount of time deficits (the time delays given specific temporal constraints). Given the requirement of Cost-effectiveness, for all the candidate strategies which are capable of handling the current temporal violation, ideally, only the one with the lowest cost should be applied. Therefore, the definition of fine-grained temporal violations and the design of exception handling strategies should be investigated as two interdependent tasks within the same exception handling framework. However, since recent studies in temporal verification mainly focus on the detection of temporal violations, fine-grained temporal violations are usually defined for the general application purpose ignoring the performance of exception handling strategies in the specific workflow systems. For example, the work in Chen and Yang (2007) proposes a multiple-states based temporal consistency model. Besides SC (strong consistency) which requires no action, three types of fine-grained temporal inconsistency states including WC (weak consistency), WI (weak inconsistency) and SI (strong inconsistency) are defined based on the minimum, mean and maximum workflow execution time. However, without the investigation on the performance of different exception handing strategies, it is difficult to determine which strategy should be applied to handle the detected temporal violations. Therefore, it is more reasonable that fine-grained temporal violations should be defined according to the selection of different exception handling strategies with different capabilities, rather than most of the previous studies where fine-grained temporal violations are defined in the first place then looking for available exception handling strategies. To the best of our knowledge, this is the first work to systematically investigate a general exception handling framework for automatic and cost-effective handling of temporal violations in scientific workflow systems.

In this paper, along with the probability based temporal consistency model which defines the range of recoverable temporal violations, a novel general automatic and cost-effective exception handling framework is proposed. Specifically, fine-grained temporal violations are first defined based on the empirical function for the capability lower bounds of the exception handling strategies. Afterwards, to serve as a case study, a concrete example framework is presented which consists of three levels of fine-grained temporal violations, viz., level I, level II and level III temporal violations defined within the recoverable probability range, and three light-weight automatic exception handling strategies, viz., TDA (Time Deficit Allocation), ACOWR (Ant Colony Optimisation based two-stage Workflow local Rescheduling) and TDA + ACOWR (the combined strategy of TDA and ACOWR). Large scale simulation experiments are conducted in the SwinDeW-G scientific grid workflow system (Yang et al., 2007) to evaluate the effectiveness of the example framework.

The remainder of the paper is organised as follows. Section 2 presents a motivating example and the problem analysis. Section 3 proposes a general exception handling framework for temporal violations. Section 4 presents a case study with a concrete exception handling framework with three levels of temporal violations and their corresponding handling strategies. Section 5 demonstrates comprehensive simulation results. Section 6 reviews the related work. Finally, Section 7 concludes the paper and points out the future work.

Section snippets

Motivating example

In this section, we present an example scientific workflow in Astrophysics. Parkes Radio Telescope (http://www.parkes.atnf.csiro.au/, located 380 km west of Sydney, Australia), one of the most famous radio telescopes, is serving institutions around the world. Swinburne Astrophysics group has been conducting a pulsar searching survey based on the observation data from Parkes Radio Telescope (http://astronomy.swin.edu.au/pulsar/). The pulsar searching process is a typical scientific workflow which

A general exception handling framework for temporal violations

In this section, an overview of a probability based temporal consistency model is presented and the range of recoverable temporal violations is defined. Afterwards, a general exception handling framework is proposed where fine-grained temporal violations are defined based on the empirical function for the capability lower bounds of exception handling strategies.

An example implementation of the framework

Based on the general exception handling framework defined in Section 3, this section presents an automatic and cost-effective exception handling framework which serves as a representative example.

Evaluations on example framework

In this section, we evaluate the performance of the example framework to demonstrate the effectiveness of our general exception handing framework. In a qualitative fashion, we can claim that our example framework satisfies the two basic requirements of Automation and Cost-effectiveness.

Automation: Based on our previous work on checkpoint selection and temporal verification (Chen and Yang, in press), different levels of temporal violations can be automatically detected in an efficient fashion.

Related work

Temporal constraint is one of the most important workflow QoS constraints besides cost, fidelity, reliability and security as discussed in Yu and Buyya (2005). In practice, a set of temporal constraints can be deemed as a QoS contract between clients and service providers. In order to successfully fulfil these contracts, efficient monitoring mechanisms such as checkpoint selection (Chen and Yang, in press) and temporal verification (Chen and Yang, 2007) are implemented to dynamically detect

Conclusions and future work

Latest studies in checkpoint selection and temporal verification can only detect temporal violations but cannot handle them. In this paper, the issue of handling temporal violations in scientific workflows has been systematically investigated and addressed by our proposed exception handling framework. Given the two fundamental requirements of Automation and Cost-effectiveness, a novel general exception handling framework has been proposed where fine-grained temporal violations are defined based

Acknowledgments

This work is partially supported by Australian Research Council under Linkage Project LP0990393, the National Natural Science Foundation of China project under Grant No. 70871033. Part of this work, particularly the example framework, has been accepted by ICPADS’2010. We are also grateful for the discussions with Dr. W. van Straten and Ms. L. Levin from Swinburne Centre for Astrophysics and Supercomputing.

References (40)

  • J. Chen et al.

    A taxonomy of grid workflow verification and validation

    Concurrency and Computation: Practice and Experience

    (2008)
  • Chen, J., Yang, Y., in press. Temporal dependency based checkpoint selection for dynamic verification of temporal...
  • W. Chen et al.

    An ant colony optimization approach to a grid workflow scheduling problem with various QoS requirements

    IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

    (2009)
  • W. Chen et al.

    Workflow scheduling in grids: an ant colony optimization approach

  • P. Choudhury et al.

    Hybrid scheduling of dynamic task graphs with selective duplication for multiprocessors under memory and time constraints

    IEEE Transactions on Parallel and Distributed Systems

    (2008)
  • K. Cooper et al.

    New grid scheduling and rescheduling methods in the GrADS project

  • E. Deelman et al.

    Workflows and e-science: an overview of workflow system features and capabilities

    Future Generation Computer Systems

    (2008)
  • J. Eder et al.

    Time constraints in workflow systems

  • I. Foster et al.

    The Grid: Blueprint for a New Computing Infrastructure

    (2004)
  • C. Hagen et al.

    Exception handling in workflow management systems

    IEEE Transactions on Software Engineering

    (2000)
  • Cited by (30)

    • Dynamic service selection with QoS constraints and inter-service correlations using cooperative coevolution

      2017, Future Generation Computer Systems
      Citation Excerpt :

      To migrate tasks to different resources throughout the execution period, [15] proposes a reallocation strategy based on resource ranking for business-oriented applications on Grid computing systems, where the resources are ranked according to the proximity to QoS description. A genetic algorithm based local rescheduling strategy is proposed in [16] which uses stochastic universal sampling selection and improved one-point crossover and mutation operators for the evolution of chromosomes. [17] proposes an exception handling framework for temporal violations, according to which small time deficits are solved by time redundancy strategy and ACOWR (Ant Colony Optimization based two-stage Workflow local Rescheduling) is used for compensating big time deficits.

    • Cost optimization approaches for scientific workflow scheduling in cloud and grid computing: A review, classifications, and open issues

      2016, Journal of Systems and Software
      Citation Excerpt :

      The main aim of this paper is to analyze the cost optimization problem in SWFS by extensively surveying the state-of-the-art SWFS approaches in cloud and grid computing. To achieve this aim, we targeted three main objectives: (1) to classify cost optimization approaches based on the relevant aspects; (2) to classify cost parameters into monetary cost (Lingfang et al., 2012; Byun et al., 2011a; Liu, 2011; Netjinda et al., 2012) and temporal cost (Wieczorek et al., 2008; Liu et al., 2011b; 2010b; Li et al., 2011) parameters based on scheduling stages (i.e. pre-scheduling, during scheduling, and post-scheduling); and (3) to identify the correlation between the cost parameters and their profitability to service consumers and service providers. Therefore, classification is used as the survey method to identify and analyze the cost aspects and parameters of SWFS.

    • Reliability-driven scheduling of time/cost-constrained grid workflows

      2016, Future Generation Computer Systems
      Citation Excerpt :

      Supporting these constraints is favorable for both users and Grid owners. From users’ viewpoint, many scientific and business goals are realized when the execution of workflow is completed within some deadline [20,21]. Users usually have monetary constraints to execute their applications [21,22].

    • Dynamic Monitoring of Service Outsourcing for Timed Workflow Processes

      2019, IEEE Transactions on Engineering Management
    View all citing articles on Scopus
    View full text