Skip to main content

2013 | Buch

Data Provenance and Data Management in eScience

herausgegeben von: Qing Liu, Quan Bai, Stephen Giugni, Darrell Williamson, John Taylor

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Computational Intelligence

insite
SUCHEN

Über dieses Buch

eScience allows scientific research to be carried out in highly distributed environments. The complex nature of the interactions in an eScience infrastructure, which often involves a range of instruments, data, models, application, people and computational facilities, suggests there is a need for data provenance and data management (DPDM). The W3C Provenance Working Group defines the provenance of a resource as a “record that describes entities and processes involved in producing and delivering or otherwise influencing that resource”. It has been widely recognised that provenance is a critical issue to enable sharing, trust, authentication and reproducibility of eScience process.

Data Provenance and Data Management in eScience identifies the gaps between DPDM foundations and their practice within eScience domains including clinical trials, bioinformatics and radio astronomy. The book covers important aspects of fundamental research in DPDM including provenance representation and querying. It also explores topics that go beyond the fundamentals including applications. This book is a unique reference for DPDM with broad appeal to anyone interested in the practical issues of DPDM in eScience domains.

Inhaltsverzeichnis

Frontmatter

Provenance in eScience: Representation and Use

Frontmatter
Provenance Model for Randomized Controlled Trials
Abstract
This chapter proposes a provenance model for the clinical research domain, focusing on the planning and conduct of randomized controlled trials, and the subsequent analysis and reporting of results from those trials. We look at the provenance requirements for clinical research and trial management of different stakeholders (researchers, clinicians, participants, IT staff) to identify elements needed at multiple levels and stages of the process. In order to address these challenges, a provenance model is defined by extending the Open Provenance Model with domain-specific additions that tie the representation closer to the expertise of medical users, and with the ultimate aim of creating the first OPM profile for randomized controlled clinical trials. As a starting point, we used the domain information model developed at University of Dusseldorf, which conforms to the ICH Guideline for Good Clinical Practice (GCP) standard, thereby ensuring the wider applicability of our work. The application of the model is demonstrated on several examples and queries based on the integrated trial data being captured as part of the TRANSFoRm EU FP7 project.
Vasa Curcin, Roxana Danger, Wolfgang Kuchinke, Simon Miles, Adel Taweel, Christian Ohmann
Evaluating Workflow Trust Using Hidden Markov Modeling and Provenance Data
Abstract
In service-oriented environments, services with different functionalities are combined in a specific order to provide higher-level functionality. Keeping track of the composition process along with the data transformations and services provides a rich amount of information for later reasoning. This information, which is referred to as provenance, is of great importance and has found its way into areas of computer science such as bioinformatics, database, social, sensor networks, etc. Current exploitation and application of provenance data is limited as provenance systems have been developed mainly for specific applications. Therefore, there is a need for a multi-functional architecture, which is application-independent and can be deployed in any area. In this chapter we describe the multi-functional architecture as well as one component, which we call workflow evaluation. Assessing the trust value of a workflow helps to determine its rate of reliability. Therefore, the trustworthiness of the results of a workflow will be inferred to decide whether the workflow’s trust rate should be improved. The improvement can be done by replacing services with low trust levels with services with higher trust levels. We provide a new approach for evaluating workflow trust based on the Hidden Markov Model (HMM). We first present how the workflow trust evaluation can be modeled as a HMM and provide information on how the model and its associated probabilities can be assessed. Then, we investigate the behavior of our model by relaxing the stationary assumption of HMM and present another model based on non-stationary hidden Markov models. We compare the results of the two models and present our conclusions.
Mahsa Naseri, Simone A. Ludwig
Unmanaged Workflows: Their Provenance and Use
Abstract
Provenance of scientific data will play an increasingly critical role as scientists are encouraged by funding agencies and grand challenge problems to share and preserve scientific data. But it is foolhardy to believe that all human processes, particularly as varied as the scientific discovery process, will be fully automated by a workflow system. Consequently, provenance capture has to be thought of as a problem applied to both human and automated processes. The unmanaged workflow is the full human-driven activity, encompassing tasks whose execution is automated by an orchestration tool, and tasks that are done outside an orchestration tool. In this chapter we discuss the implications of the unmanaged workflow as it affects provenance capture, representation, and use. Illustrations of capture include multiple experiences with unmanaged capture using the Karma tool. Illustrations of use include defining workflows by suggesting additions to workflow designs under construction, reconstructing process traces, and using analysis tools to assess provenance quality.
Mehmet S. Aktas, Beth Plale, David Leake, Nirmal K. Mukhi

Data Provenance and Data Management Systems

Frontmatter
Sketching Distributed Data Provenance
Abstract
Users can determine the precise origins of their data by collecting detailed provenance records. However, auditing at a finer grain produces large amounts of metadata. To efficiently manage the collected provenance, several provenance management systems, including SPADE, record provenance on the hosts where it is generated. Distributed provenance raises the issue of efficient reconstruction during the query phase. Recursively querying provenance metadata or computing its transitive closure is known to have limited scalability and cannot be used for large provenance graphs. We present matrix filters, which are novel data structures for representing graph information, and demonstrate their utility for improving query efficiency with experiments on provenance metadata gathered while executing distributed workflow applications.
Tanu Malik, Ashish Gehani, Dawood Tariq, Fareed Zaffar
A Mobile Cloud with Trusted Data Provenance Services for Bioinformatics Research
Abstract
Cloud computing provides a cheap yet reliable outsourcing model for anyone who needs large computing resources. Together with the Cloud, Service Oriented Architecture (SOA) allows the construction of scientific workflows to bring together various scientific computing tools offered as services in the Cloud, to answer complex research questions. In those scientific workflows, certain critical steps need the participation of research personnel or experts. It is highly desirable that scientists have easy access, such as mobile devices, to the workflows running in the Cloud. Furthermore, since the participants in this cross-domain collaboration barely trust each other, achieving reliable data provenance becomes a challenging task. This book chapter aims to discuss these issues and possible solutions. In this book chapter, we describe a Mobile Cloud system with a trusted provenance mechanism. The Mobile Cloud system facilitates the use of mobile devices to manipulate and interact with the scientific workflows running in the Cloud. Moreover, it provides trusted data provenance by acting as a trusted third party to record provenance data submitted by the participating services during the workflow execution. We have implemented a prototype which allows the bioinformatics workflow design and participation using mobile devices. We prove the concept of Mobile Cloud with the prototype and conducted performance evaluation for the significant points of bioinformatics workflow platform.
Jinhui Yao, Jingyu Zhang, Shiping Chen, Chen Wang, David Levy, Qing Liu
Data Provenance and Management in Radio Astronomy: A Stream Computing Approach
Abstract
New approaches for data provenance and data management (DPDM) are required for mega science projects like the Square Kilometer Array, characterized by extremely large data volume and intense data rates, therefore demanding innovative and highly efficient computational paradigms. In this context, we explore a stream-computing approach with the emphasis on the use of accelerators. In particular, we make use of a new generation of high performance stream-based parallelization middleware known as InfoSphere Streams. Its viability for managing and ensuring interoperability and integrity of signal processing data pipelines is demonstrated in radio astronomy.
IBM InfoSphere Streams embraces the stream-computing paradigm. It is a shift from conventional data mining techniques (involving analysis of existing data from databases) towards real-time analytic processing. We discuss using InfoSphere Streams for effective DPDM in radio astronomy and propose a way in which InfoSphere Streams can be utilized for large antennae arrays. We present a case-study: the InfoSphere Streams implementation of an autocorrelating spectrometer, and using this example we discuss the advantages of the stream-computing approach and the utilization of hardware accelerators.
Mahmoud S. Mahmoud, Andrew Ensor, Alain Biem, Bruce Elmegreen, Sergei Gulyaev
Using Provenance to Support Good Laboratory Practice in Grid Environments
Abstract
Conducting experiments and documenting results is daily business of scientists. Good and traceable documentation enables other scientists to confirm procedures and results for increased credibility. Documentation and scientific conduct are regulated and termed as “good laboratory practice.” Laboratory notebooks are used to record each step in conducting an experiment and processing data. Originally, these notebooks were paper based. Due to computerised research systems, acquired data became more elaborate, thus increasing the need for electronic notebooks with data storage, computational features and reliable electronic documentation. As a new approach to this, a scientific data management system (DataFinder) is enhanced with features for traceable documentation. Provenance recording is used to meet requirements of traceability, and this information can later be queried for further analysis. DataFinder has further important features for scientific documentation: It employs a heterogeneous and distributed data storage concept. This enables access to different types of data storage systems (e. g. Grid data infrastructure, file servers). In this chapter we describe a number of building blocks that are available or close to finished development. These components are intended for assembling an electronic laboratory notebook for use in Grid environments, while retaining maximal flexibility on usage scenarios as well as maximal compatibility overlap towards each other. Through the usage of such a system, provenance can successfully be used to trace the scientific workflow of preparation, execution, evaluation, interpretation and archiving of research data. The reliability of research results increases and the research process remains transparent to remote research partners.
Miriam Ney, Guy K. Kloss, Andreas Schreiber
Backmatter
Metadaten
Titel
Data Provenance and Data Management in eScience
herausgegeben von
Qing Liu
Quan Bai
Stephen Giugni
Darrell Williamson
John Taylor
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-29931-5
Print ISBN
978-3-642-29930-8
DOI
https://doi.org/10.1007/978-3-642-29931-5

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.