Skip to main content

2007 | Buch

Workflows for e-Science

Scientific Workflows for Grids

herausgegeben von: Ian J. Taylor, PhD, Ewa Deelman, PhD, Dennis B. Gannon, PhD, Matthew Shields, PhD

Verlag: Springer London

insite
SUCHEN

Über dieses Buch

This collection of articles on ‘Work?ows for e-Science’ is very timely and - portant. Increasingly, to attack the next generation of scienti?c problems, multidisciplinary and distributed teams of scientists need to collaborate to make progress on these new ‘Grand Challenges’. Scientists now need to access and exploit computational resources and databases that are geographically distributed through theuseof high speed networks. ‘Virtual Organizations’ or ‘VOs’ must be established that span multiple administrative domains and/or institutions and which can provide appropriate authentication and author- ation services and access controls to collaborating members. Some of these VOsmayonlyhavea?eetingexistencebutthelifetimeofothersmayrun into many years. The Grid community is attempting to develop both sta- ards and middleware to enable both scientists and industry to build such VOs routinely and robustly. This, of course, has been the goal of research in distributed computing for many years; but now these technologies come with a new twist service orie- ation. By specifying resources in terms of a service description, rather than allowing direct access to the resources, the IT industry believes that such an approach results in the construction of more robust distributed systems. The industry has therefore united around web services as the standard technology toimplementsuchserviceorientedarchitecturesandtoensureinteroperability between di?erent vendor systems.

Inhaltsverzeichnis

Frontmatter

Introduction

1. Introduction
Abstract
Workflows for e-Science is divided into four parts, which represent four broad but distinct areas of scientific workflows. In the first part, Background, we introduce the concept of scientific workflows and set the scene by describing how they differ from their business workflow counterpart. In Part II, Application and User Perspective, we provide a number of scientific examples that currently use workflows for their e-Science experiments. In Workflow Representation and Common Structure (Part III), we describe core workflow themes, such as control flow or dataflow and the use of components or services. In this part, we also provide overviews for a number of common workflow languages, such as Petri Nets, the Business Process Execution Language (BPEL), and the Virtual Data Language (VDL), along with service interfaces. In Part IV, Frameworks and Tools, we take a look at many of the popular environments that are currently being used for e-Science applications by paying particular attention to their workflow capabilities. The following four sections describe the chapters in each part and therefore provide a comprehensive summary of the book as a whole.
Dennis Gannon, Ewa Deelman, Matthew Shields, Ian Taylor

Scientific versus Business Workflows

2. Scientific versus Business Workflows
Abstract
The formal concept of a workflow has existed in the business world for a long time. An entire industry of tools and technology devoted to workflow management has been developed and marketed to meet the needs of commercial enterprises. The Workflow Management Coalition (WfMC) has existed for over ten years and has developed a large set of reference models, documents, and standards. Why has the scientific community not adopted these existing standards? While it is not uncommon for the scientific community to reinvent technology rather than purchase existing solutions, there are issues involved in the technical applications that are unique to science, and we will attempt to characterize some of these here. There are, however, many core concepts that have been developed in the business workflow community that directly relate to science, and we will outline them below.
Roger Barga, Dennis Gannon

Application and User Perspective

Frontmatter
3. Generating Complex Astronomy Workflows
Abstract
Astronomy has a rich heritage of discovery using image data sets that cover the full range of the electromagnetic spectrum. Image data sets in one frequency range have often been studied in isolation from those in other frequency ranges. This is mostly a consequence of the diverse properties of the data collections themselves. Images are delivered in different coordinate systems, map projections, spatial samplings, and image sizes, and the pixels themselves are rarely co-registered on the sky. Moreover, the spatial extent of many astronomically important structures, such as clusters of galaxies and star formation regions, is often substantially greater than that of individual images.
G. Bruce Berriman, Ewa Deelman, John Good, Joseph C. Jacob, Daniel S. Katz, Anastasia C. Laity, Thomas A. Prince, Gurmeet Singh, Mei-Hui Su
4. A Case Study on the Use of Workflow Technologies for Scientific Analysis: Gravitational Wave Data Analysis
Abstract
Modern scientific experiments acquire large amounts of data that must be analyzed in subtle and complicated ways to extract the best results. The Laser Interferometer Gravitational Wave Observatory (LIGO) is an ambitious effort to detect gravitational waves produced by violent events in the universe, such as the collision of two black holes or the explosion of supernovae [37,258]. The experiment records approximately 1 TB of data per day, which is analyzed by scientists in a collaboration that spans four continents. LIGO and distributed computing have grown up side by side over the past decade, and the analysis strategies adopted by LIGO scientists have been strongly influenced by the increasing power of tools to manage distributed computing resources and the workflows to run on them. In this chapter, we use LIGO as an application case study in workflow design and implementation. The software architecture outlined here has been used with great efficacy to analyze LIGO data [2–5] using dedicated computing facilities operated by the LIGO Scientific Collaboration, the LIGO Data Grid. It is just the first step, however. Workflow design and implementation lies at the interface between computing and traditional scientific activities. In the conclusion, we outline a few directions for future development and provide some long-term vision for applications related to gravitational wave data analysis.
Duncan A. Brown, Patrick R. Brady, Alexander Dietz, Junwei Cao, Ben Johnson, John McNabb
5. Workflows in Pulsar Astronomy
Abstract
In this chapter, we describe the development of methods that operate on the output of the signal of a radio telescope to detect the characteristic signals of pulsars. These signals are much weaker than the noise in the signal at any given wavelength, and therefore algorithms for combining the signals in different wavelength bands must be applied. This is heavily expensive in terms of CPU power. Early versions of distributed algorithms ran on a distributed network of supercomputers connected by Internet-aware Message Passing Interface (MPI) during the period 1999–2001. Today such techniques are being integrated into workflows that automate the search process and enable sophisticated astronomical knowledge to be captured via the construction of the workflow. In particular, we address issues of parallelism within components of the workflow. Parallelism is necessary due to two constraints on workflow performance. One is the application of the workflow in real time as the signal is being processed to enable very precise measurements to be carried out on known pulsars. The other is the use of the workflow to explore large regions of parameter space in search of previously undetected pulsars. There are very severe restraints on the degree of abstraction that can currently be applied in this work since details of the architecture of the computing resource (parallel cluster or computational Grid) on which the workflows are to be run cannot be ignored in the construction of the workflow.
John Brooke, Stephen Pickles, Paul Carr, Michael Kramer
6. Workflow and Biodiversity e-Science
Abstract
Biodiversity e-Science is characterized by the use of a wide range of different kinds of data and by performing complex analyses on these data. In this chapter, we discuss the use of workflow systems to assist biodiversity researchers and consider how such systems can provide repeatability of experiments and other benefits.We argue that nevertheless there are also limitations to this kind of approach, and we discuss how more flexibility in a more exploratory environment could be achieved.
Andrew C. Jones
7. Ecological Niche Modeling Using the Kepler Workflow System
Abstract
Changes in biodiversity have been linked to variations in climate and human activities [295]. These changes have implications for a wide range of socially relevant processes, including the spread of infectious disease, invasive species dynamics, and vegetation productivity [27, 70, 203, 291, 294, 376, 426]. Our understanding of biodiversity patterns and processes through space and time, scaling from genes to continents, is limited by our ability to analyze and synthesize multidimensional data effectively from sources as wide-ranging as field and laboratory experiments, satellite imagery, and simulation models.
Deana D. Pennington, Dan Higgins, A. Townsend Peterson, Matthew B. Jones, Bertram Ludäscher, Shawn Bowers
8. Case Studies on the Use of Workflow Technologies for Scientific Analysis: The Biomedical Informatics Research Network and the Telescience Project
Abstract
The advent of “Grids,” or Grid computing, has led to a fundamental shift in the development of applications for managing and performing computational or data-intensive analyses. A current challenge faced by the Grid community entails modeling the work patterns of domain or bench scientists and providing robust solutions utilizing distributed infrastructures. These challenges spawned efforts to develop “workflows” to manage programs and data on behalf of the end user. The technologies come from multiple scientific fields, often with disparate definitions, and have unique advantages and disadvantages, depending on the nature of the scientific process in which they are used. In this chapter, we argue that to maximize the impact of these efforts, there is value in promoting the use of workflows within a tiered, hierarchical structure where each of these emerging workflow pieces are interoperable. We present workflow models of the Telescience™ Project1 and BIRN2 architectures as frameworks that manage multiple tiers of workflows to provide tailored solutions for end-to-end scientific processes.
Abel W. Lin, Steven T. Peltier, Jeffrey S. Grethe, Mark H. Ellisman
9. Dynamic, Adaptive Workflows for Mesoscale Meteorology
Abstract
The Linked Environments for Atmospheric Discovery (LEAD) [122] is a National Science Foundation funded1 project to change the paradigm for mesoscale weather prediction from one of static, fixed-schedule computational forecasts to one that is adaptive and driven by weather events. It is a collaboration of eight institutions,2 led by Kelvin Droegemeier of the University of Oklahoma, with the goal of enabling far more accurate and timely predictions of tornadoes and hurricanes than previously considered possible. The traditional approach to weather prediction is a four-phase activity. In the first phase, data from sensors are collected. The sensors include ground instruments such as humidity and temperature detectors, and lightning strike detectors and atmospheric measurements taken from balloons, commercial aircraft, radars, and satellites. The second phase is data assimilation, in which the gathered data are merged together into a set of consistent initial and boundary conditions for a large simulation. The third phase is the weather prediction, which applies numerical equations to measured conditions in order to project future weather conditions. The final phase is the generation of visual images of the processed data products that are analyzed to make predictions. Each phase of activity is performed by one or more application components.
Dennis Gannon, Beth Plale, Suresh Marru, Gopi Kandaswamy, Yogesh Simmhan, Satoshi Shirasuna
10. SCEC CyberShake Workflows—Automating Probabilistic Seismic Hazard Analysis Calculations
Abstract
The Southern California Earthquake Center (SCEC) is a community of more than 400 scientists from over 54 research organizations that conducts geophysical research in order to develop a physics-based understanding of earthquake processes and to reduce the hazard from earthquakes in the Southern California region [377].
Philip Maechling, Ewa Deelman, Li Zhao, Robert Graves, Gaurang Mehta, Nitin Gupta, John Mehringer, Carl Kesselman, Scott Callaghan, David Okaya, Hunter Francoeur, Vipin Gupta, Yifeng Cui, Karan Vahi, Thomas Jordan, Edward Field

Workflow Representation and Common Structure

Frontmatter
11. Control- Versus Data-Driven Workflows
Abstract
Workflow is typically defined as a sequence of operations or tasks needed to manage a business process or computational activity (Chapter 1). The representation of the sequence of operations or tasks is handled in many different ways by different people and varies from simple scripting languages, through graphs represented in textual or graphical form, to mathematical representations such as Petri Nets (Chapter 13) or π-calculus (Chapter 15).
Matthew Shields
12. Component Architectures and Services: From Application Construction to Scientific Workflows
Abstract
The idea of building computer applications by composing them out of reusable software components is a concept that emerged in the 1970s and 1980s as developers began to realize that the complexity of software was evolving so rapidly that a different approach was needed if actual software development was going to keep pace with the demands placed upon it.1 This fact had already been realized by hardware designers. By the mid 1970s, it was standard practice to build digital systems by composing them from standard, well-tested integrated circuits that encapsulated sophisticated, powerful subsystems that we easily reused in thousands of applications. By the 1990s, even the designers of integrated circuits such as microprocessors were building them by composing them from standard cell libraries that provided components such as registers and floating-point units that could be arranged on the chip and easily integrated to form a full processor. Now, multiple processor cores can be assembled on a single chip as components of larger systems.
Dennis Gannon
13. Petri Nets
Abstract
In 1962, C.A. Petri introduced in his Ph.D. thesis [351] a formalism for describing distributed processes by extending state machines with a notion of concurrency. Due to the simple and intuitive, but at the same time formal and expressive, nature of his formalism, Petri Nets became an established tool for modelling and analyzing distributed processes in business as well as the IT sector. This chapter gives a brief introduction to the theory of Petri Nets and shows how Petri Nets can be applied for effective workflow management with regard to the choreography, orchestration, and enactment of e-Science applications. While choreography deals with the abstract modelling of applications, orchestration deals with the mapping onto concrete software components and the infrastructure. During the enactment of e-Science applications, runtime issues, such as synchronization, persistence, transaction safety, and fault management, are examined within the workflow formalism.
Andreas Hoheisel, Martin Alt
14. Adapting BPEL to Scientific Workflows
Abstract
In this chapter, we examine the degree to which a de facto standard business Web services workflow language, Business Process Execution Language for Web Services (BPEL4WS), can be used to compose Grid and scientific workflows. As the Grid application models, such as Open Grid Services Architecture (OGSA) [146], move toward Web services and service-oriented architecture (SOA) [135], supporting Web services is becoming a requirement for a Grid workflow language.
Aleksander Slominski
15. Protocol-Based Integration Using SSDL and π-Calculus
Abstract
A “service” has become the contemporary abstraction around which modern distributed applications are designed and built. A service represents a piece of functionality that is exposed on the network. The “message” abstraction is used to create interaction patterns or protocols to represent the messaging behavior of a service. In the Web services domain, SOAP is the preferred model for encoding, transferring, and processing such messages.
Simon Woodman, Savas Parastatidis, Jim Webber
16. Workflow Composition: Semantic Representations for Flexible Automation
Abstract
Many different kinds of users may need to compose scientific workflows for different purposes. This chapter focuses on the requirements and challenges of scientific workflow composition. They are motivated by our work with two particular application domains: physics-based seismic hazard analysis (Chapter 10) and data-intensive natural language processing [238]. Our research on workflow creation spans fully automated workflow generation (Chapter 23) using artificial intelligence planning techniques for assisted workflow composition [237,276] by combining semantic representations of workflow components with formal properties of correct workflows. Other projects have used similar techniques in different domains to support workflow composition through planning and automated reasoning [286,289,415] and semantic representations (Chapter 19). As workflow representations become more declarative and expressive, they enable significant improvements in automation and assistance for workflow composition and in general for managing and automating complex scientific processes. The chapter starts off motivating and describing important requirements to support the creation of workflows. Based on these requirements, we outline the approaches that we have found effective, including separating levels of abstraction in workflow descriptions, using semantic representations of workflows and their components, and supporting flexible automation through reuse and automatic completion of user specifications of partial workflows.
Yolanda Gil
17. Virtual Data Language: A Typed Workflow Notation for Diversely Structured Scientific Data
Abstract
When constructing workflows that operate on large and complex data sets, the ability to describe the types of both data sets and workflow procedures can be invaluable, enabling discovery of data sets and procedures, type checking and composition of procedure calls, and iteration over composite data sets.
Yong Zhao, Michael Wilde, Ian Foster

Frameworks and Tools: Workflow Generation, Refinement, and Execution

Frontmatter
18. Workflow-Level Parametric Study Support by MOTEUR and the P-GRADE Portal
Abstract
Many large-scale scientific applications require the processing of complete data sets made of individual data segments that can be manipulated independently following a single analysis procedure. Workflow managers have been designed for describing and controlling such complex application control flows. However, when considering very data-intensive applications, there is a large potential parallelism that should be properly exploited to ensure efficient processing. Distributed systems such as Grid infrastructures are promising for handling the load resulting from parallel data analysis and manipulation. Workflow managers can help in exploiting the infrastructure parallelism, given that they are able to handle the data flow resulting from the application’s execution.
Tristan Glatard, Gergely Sipos, Johan Montagnat, Zoltan Farkas, Peter Kacsuk
19. Taverna/myGrid: Aligning a Workflow System with the Life Sciences Community
Abstract
Bioinformatics is a discipline that uses computational and mathematical techniques to store, manage, and analyze biological data in order to answer biological questions. Bioinformatics has over 850 databases [154] and numerous tools that work over those databases and local data to produce even more data themselves. In order to perform an analysis, a bioinformatician uses one or more of these resources to gather, filter, and transform data to answer a question. Thus, bioinformatics is an in silico science.
Tom Oinn, Peter Li, Douglas B. Kell, Carole Goble, Antoon Goderis, Mark Greenwood, Duncan Hull, Robert Stevens, Daniele Turi, Jun Zhao
20. The Triana Workflow Environment: Architecture and Applications
Abstract
In this chapter, the Triana workflow environment is described. Triana focuses on supporting services within multiple environments, such as peer-to-peer (P2P) and the Grid, by integrating with various types of middleware toolkits. This approach differs from that of the last chapter, which gave an overview of Taverna, a system designed to support scientists using Grid technology to conduct in silico experiments in biology. Taverna focuses workflow at the Web services level and addresses concerns of how such services should be presented to its users.
Ian Taylor, Matthew Shields, Ian Wang, Andrew Harrison
21. Java CoG Kit Workflow
Abstract
In order to satisfy the need for sophisticated experiment and simulation management solutions for the scientific user community, various frameworks must be provided. Such frameworks include APIs, services, templates, patterns, GUIs, command-line tools, and workflow systems that are specifically addressed towards the goal of assisting in the complex process of experiment and simulation management. Workflow by itself is just one of the ingredients for a successful experiment and simulation management tool.
Gregor von Laszewski, Mihael Hategan, Deepti Kodeboyina
22. Workflow Management in Condor
Abstract
The Condor project began in 1988 and has evolved into a feature-rich batch system that targets high-throughput computing; that is, Condor ([262], [414]) focuses on providing reliable access to computing over long periods of time instead of highly tuned, high-performance computing for short periods of time or a small number of applications.
Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, Kent Wenger
23. Pegasus: Mapping Large-Scale Workflows to Distributed Resources
Abstract
Many scientific advances today are derived from analyzing large amounts of data. The computations themselves can be very complex and consume significant resources. Scientific efforts are also not conducted by individual scientists; rather, they rely on collaborations that encompass many researchers from various organizations. The analysis is often composed of several individual application components designed by different scientists. To describe the desired analysis, the components are assembled in a workflow where the dependencies between them are defined and the data needed for the analysis are identified. To support the scale of the applications, many resources are needed in order to provide adequate performance. These resources are often drawn from a heterogeneous pool of geographically distributed compute and data resources. Running large-scale, collaborative applications in such environments has many challenges. Among them are systematic management of the applications, their components, and the data, as well as successful and efficient execution on the distributed resources.
Ewa Deelman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi
24. ICENI
Abstract
Performing large-scale science is becoming increasingly complex. Scientists have resorted to the use of computing tools to enable and automate their experimental process. As acceptance of the technology grows, it will become commonplace that computational experiments will involve larger data sets, more computational resources, and scientists (often referred to as e-Scientists) distributed across geographical and organizational boundaries. We see the Grid paradigm as an abstraction to a large collection of distributed heterogeneous resources, including computational, storage, and instrument elements, controlled and shared by different organizations. Grid computing should facilitate the e-Scientist’s ability to run applications in a transparent manner.
A. Stephen McGough, William Lee, Jeremy Cohen, Eleftheria Katsiri, John Darlington
25. Expressing Workflow in the Cactus Framework
Abstract
The Cactus Framework [15, 73, 167] is an open-source, modular, portable, programming environment for collaborative HPC computing. It was designed and written specifically to enable scientists and engineers to perform the large-scale simulations needed for their science. From the outset, Cactus has followed two fundamental tenets: respecting user needs and embracing new technologies. The framework and its associated components must be driven from the beginning by user requirements. This has been achieved by developing, supporting, and listening to a large user base. Among these needs are ease of use, portability, the ability to support large and geographically diverse collaborations and to handle enormous computing resources, visualization, file IO, and data management. It must also support the inclusion of legacy code, as well as a range of programming languages. It is essential that any living framework be able to incorporate new and developing cutting edge computation technologies and infrastructure, with minimal or no disruption to its user base. Cactus is now associated with many computational science research projects, particularly in visualization, data management, and Grid computing [14].
Tom Goodale
26. Sedna: A BPEL-Based Environment for Visual Scientific Workflow Modeling
Abstract
Scientific Grid computing environments are increasingly adopting the Open Grid Services Architecture (OGSA), which is a service-oriented architecture for Grids. With the proliferation of OGSA, Grids effectively consist of a collection of Grid services, Web services with certain extensions providing additional support for state and life cycle management. Hence, the need arises for some means of composing these basic services into larger workflows in order to, for example, express a scientific experiment.
Bruno Wassermann, Wolfgang Emmerich, Ben Butchart, Nick Cameron, Liang Chen, Jignesh Patel
27. ASKALON: A Development and Grid Computing Environment for Scientific Workflows
Abstract
Most existing Grid application development environments provide the application developer with a nontransparent Grid. Commonly, application developers are explicitly involved in tedious tasks such as selecting software components deployed on specific sites, mapping applications onto the Grid, or selecting appropriate computers for their applications. Moreover, many programming interfaces are either implementation-technology-specific (e.g., based on Web services [24]) or force the application developer to program at a low-level middleware abstraction (e.g., start task, transfer data [22, 153]). While a variety of graphical workflow composition tools are currently being proposed, none of them is based on standard modeling techniques such as Unified Modeling Language (UML).
Thomas Fahringer, Radu Prodan, Rubing Duan, Jüurgen Hofer, Farrukh Nadeem, Francesco Nerieri, Stefan Podlipnig, Jun Qin, Mumtaz Siddiqui, Hong-Linh Truong, Alex Villazon, Marek Wieczorek

Future Requirements

Frontmatter
Looking into the Future of Workflows: The Challenges Ahead
Abstract
In this chapter, we take a step back from the individual applications and software systems and attempt to categorize the types of issues that we are facing today and the challenges we see ahead. This is by no means a complete picture of the challenges but rather a set of observations about the various aspects of workflow management. In a broad sense, we are organizing our thoughts in terms of the different workflow systems discussed in this book, from the user interface down to the execution environment.
Ewa Deelman
Backmatter
Metadaten
Titel
Workflows for e-Science
herausgegeben von
Ian J. Taylor, PhD
Ewa Deelman, PhD
Dennis B. Gannon, PhD
Matthew Shields, PhD
Copyright-Jahr
2007
Verlag
Springer London
Electronic ISBN
978-1-84628-757-2
Print ISBN
978-1-84628-519-6
DOI
https://doi.org/10.1007/978-1-84628-757-2