Skip to main content

2010 | Buch

Provenance and Annotation of Data and Processes

Third International Provenance and Annotation Workshop, IPAW 2010, Troy, NY, USA, June 15-16, 2010. Revised Selected Papers

herausgegeben von: Deborah L. McGuinness, James R. Michaelis, Luc Moreau

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The 7 revised full papers, 11 revised medium-length papers, 6 revised short, and 7 demo papers presented together with 10 poster/abstract papers describing late-breaking work were carefully reviewed and selected from numerous submissions. Provenance has been recognized to be important in a wide range of areas including databases, workflows, knowledge representation and reasoning, and digital libraries. Thus, many disciplines have proposed a wide range of provenance models, techniques, and infrastructure for encoding and using provenance. The papers investigate many facets of data provenance, process documentation, data derivation, and data annotation.

Inhaltsverzeichnis

Frontmatter

Keynotes

On Provenance and Privacy

Provenance is a double-edged sword. On the one hand, it enables transparency, understanding the ”why” and ”where” of data, and reproducibility of results. On the other hand, it potentially exposes intermediate data and the functionality of modules within the workflow. However, a scientific workflow often deals with proprietary modules as well as private or confidential data, such as genomic or medical information. Hence providing exact answers to provenance queries over all executions of the workflow may reveal private information. In this talk we discuss potential privacy issues in a scientific workflow - module privacy, data privacy, and provenance privacy - and frame several natural questions: (i) Can we formally analyze module, data or provenance privacy giving provable privacy guarantees for an unlimited/bounded number of provenance queries? (ii) How can we answer provenance queries, providing as much information as possible to the user while still guaranteeing the required privacy? Then we look at module privacy in detail and propose a formal model. Finally we point to several directions for future work.

Susan B. Davidson

Papers

The Provenance of Workflow Upgrades

Provenance has become an increasingly important part of documenting, verifying, and reproducing scientific research, but as users seek to extend or share results, it may be impractical to start from the exact original steps due to system configuration differences, library updates, or new algorithms. Although there have been several approaches for capturing workflow provenance, the problem of managing upgrades of the underlying tools and libraries orchestrated by workflows has been largely overlooked. In this paper we consider the problem of maintaining and re-using the provenance of workflow upgrades. We propose different kinds of upgrades that can be applied, including automatic mechanisms, developer-specified, and user-defined. We show how to capture provenance from such upgrades and suggest how this provenance might be used to influence future upgrades. We also describe our implementation of these upgrade techniques.

David Koop, Carlos E. Scheidegger, Juliana Freire, Cláudio T. Silva
Approaches for Exploring and Querying Scientific Workflow Provenance Graphs

While many scientific workflow systems track and record data provenance, few tools have been developed that provide convenient and effective ways to access and explore this information. Two important ways for provenance information to be accessed and explored is through browsing (i.e., visualizing and navigating data and process dependencies) and querying (e.g., to select certain portions of provenance graphs or to determine if certain paths exist between items within a graph). We extend our prior work on representing and querying data provenance by showing how these can be effectively and efficiently combined into an interactive provenance browser. The browser allows different views of provenance to be explored and queried, where queries are expressed in a declarative graph-based provenance query language. Query results are expressed as provenance subgraphs, which can be further visualized and navigated through the browser. The browser supports a generic model of provenance that can be used with various workflow computation models, and has a direct translation to the Open Provenance Model. We present the provenance model, the query language, and describe the overall browser architecture and implementation.

Manish Kumar Anand, Shawn Bowers, Ilkay Altintas, Bertram Ludäscher
Automatic Provenance Collection and Publishing in a Science Data Production Environment—Early Results

The Earth System Science Server (ES3) system transparently collects provenance information from executing code. Provenance information (ancestors or descendants) for any process or data granule may then be retrieved from a web service, in both textual and graphical formats. We have installed ES3 in a quasi-production environment, wherein multiple Earth satellite data streams are synthesized into daily grids of global ocean color parameters, and the resulting data granules published online. ES3’s non-intrusive nature makes its insertion into such an environment fairly straightforward, but considerations such as collating distributed provenance (from processes spread across computing clusters) and sharing unique identifiers (to link programs and data granules with their separately-maintained provenance) must still be addressed. We present for discussion our preliminary results from assembling such an environment.

James Frew, Greg Janée, Peter Slaughter
Leveraging the Open Provenance Model as a Multi-tier Model for Global Climate Research

Global climate researchers rely upon many forms of sensor data and analytical methods to help profile subtle changes in climate conditions. The U.S. Department of Energy’s Atmospheric Radiation Measurement (ARM) program provides researchers with a collection of curated Value Added Products (VAPs) resulting from continuous sensor data streams, data fusion, and modeling. We are leveraging the Open Provenance Model as a foundational construct that serves the needs of both the VAP producers and consumers. We are organizing the provenance in different tiers of granularity to model VAP lineage, causality at the component level within a VAP, and the causality for each time step as samples are being assembled within the VAP. This paper shares our implementation strategy and how the ARM operations staff and the climate research community can greatly benefit from this approach to more effectively assess and quantify VAP provenance.

Eric G. Stephan, Todd D. Halter, Brian D. Ermold
Understanding Collaborative Studies through Interoperable Workflow Provenance

The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical execution of many scientific workflows with many datasets produced at different times by one or more users. Further, to promote and facilitate exchange of information between multiple workflow systems supporting provenance, the Open Provenance Model (OPM) has been proposed by the scientific workflow community. In this paper, we describe a new query model that captures implicit user collaborations. We show how this model maps to OPM and helps to answer collaborative queries, e.g., identifying combined workflows and contributions of users collaborating on a project based on the records of previous workflow executions. We also adopt and extend the high-level Query Language for Provenance (QLP) with additional constructs, and show how these extensions allow non-expert users to express collaborative provenance queries against this model easily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3) workflows as a collaborative and interoperable usecase scenario, where different stages of the workflow are executed in three different workflow environments - Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how we can establish and understand collaborative studies through interoperable workflow provenance.

Ilkay Altintas, Manish Kumar Anand, Daniel Crawl, Shawn Bowers, Adam Belloum, Paolo Missier, Bertram Ludäscher, Carole A. Goble, Peter M. A. Sloot
Provenance of Software Development Processes

”Why does the build fail currently?” - This and similar questions arise on a daily basis in software development processes (SDP). There is no easy way to answer these questions, the required information is stored throughout different tools, the version control and continuous integration systems in this example. The tools mainly live in isolated worlds and no direct connection between their data exists. This paper proposes a solution to such problems, based on provenance technologies. After outlining the complexity of a SDP, the questions arising on a daily basis are categorized. Finally an approach to make the SDP provenance-aware is proposed based on PRiME, the Open Provenance Model and a SOA architecture using Neo4j to store the data, Gremlin to query it and REST webservices as connection to the tools.

Heinrich Wendel, Markus Kunde, Andreas Schreiber
Provenance-Awareness in R

It is generally acknowledged that when, in 1988, John Chambers and Richard Becker incorporated the

S AUDIT

facility into their

S

statistical programming language and environment, they created one of the first provenance-aware applications. Since then,

S

has been spiritually succeeded by the open-source

R

project; however,

R

has no such facility for tracking provenance. This paper looks at how provenance-awareness is being introduced to

CXXR

(

http://www.cs.kent.ac.uk/projects/cxxr

), a variant of the

R

interpreter designed to allow creation of experimental

R

versions. We explore the issues surrounding recording, representing, and interrogating provenance information in a command-line driven interactive environment that utilises a lazy functional programming language. We also characterise provenance information in this domain and evaluate the impact of adding facilities for provenance tracking.

Chris A. Silles, Andrew R. Runnalls
SAF: A Provenance-Tracking Framework for Interoperable Semantic Applications

This paper describes the foundations of a framework for constructing interoperable semantic applications that support recording of provenance information. The framework uses a client-server infrastructure to control the encoding of application. Provenance records for application components, settings, and data sources are stored as part of the final application file using the Open Provenance Model (OPM) [1]. The application can render events such as setting changes to users so that they can identify when collaborators make changes to the application. We demonstrate how the system can be used to collaborate on a project, identify errors in data sources, and extrapolate insights to other data sets by making changes to the application. Lastly, we outline some key issues related to using asymmetric key encryption for tracking changes in semantic content and how we address them (or not) within this framework.

Evan W. Patton, Dominic Difranzo, Deborah L. McGuinness
Publishing and Consuming Provenance Metadata on the Web of Linked Data

The World Wide Web evolves into a Web of Data, a huge, globally distributed dataspace that contains a rich body of machine-processable information from a virtually unbound set of providers covering a wide range of topics. However, due to the openness of the Web little is known about who created the data and how. The fact that a large amount of the data on the Web is derived by replication, query processing, modification, or merging raises concerns of information quality. Poor quality data may propagate quickly and contaminate the Web of Data. Provenance information about who created and published the data and how, provides the means for quality assessment. This paper takes a first step towards creating a quality-aware Web of Data: we present approaches to integrate provenance information into the Web of Data and we illustrate how this information can be consumed. In particular, we introduce a vocabulary to describe provenance of Web data as metadata and we discuss possibilities to make such provenance metadata accessible as part of the Web of Data. Furthermore, we describe how this metadata can be queried and consumed to identify outdated information.

Olaf Hartig, Jun Zhao
POMELo: A PML Online Editor

This paper introduces POMELo, a simple, web-based PML (Proof Markup Language) editor. The objective of POMELo is to allow users to create, edit, validate and export provenance information in the form of PML documents. This application was developed with provenance novices in mind, making it usable in various settings, from educational to scientific. Since this is a web-based application, users do not need to install or run any software aside from a normal web browser, which simplifies its adoption and makes it more attractive for inexperienced users.

Alvaro Graves
Capturing Provenance in the Wild

All current provenance systems are “closed world” systems; provenance is collected within the confines of a well understood, pre-planned system. However, when users compose services from heterogeneous systems and organizations to form a new application, it is impossible to track the provenance in the new system using currently available work. In this work, we describe the ability to compose multiple provenance-unaware services in an “open world” system and still collect provenance information about their execution. Our approach is implemented using the PLUS provenance system and the open source MULE Enterprise Service Bus. Our evaluations show that this approach is scalable and has minimal overhead.

M. David Allen, Adriane Chapman, Barbara Blaustein, Len Seligman
Automatically Adapting Source Code to Document Provenance

Being able to ask questions about the provenance of some data requires documentation on each influence on that data’s existence and content. Much software exists, and is being developed, for which there is no provenance-awareness, i.e. at best, the data it outputs can be connected to its inputs, but with no record of intermediate processing. Further, where some record of processing does exist, e.g. as logs, it is not in a form easily connected with that of other processes. We would like to enable compiled software to record useful documentation without requiring prior manual adaptation. In this paper, we present an approach to adapting source code from its original form without manual manipulation, to record information on data provenance during execution.

Simon Miles
Using Data Provenance to Measure Information Assurance Attributes

Data Provenance is multi-dimensional metadata that specifies Information Assurance attributes like Confidentiality, Authenticity, Integrity, Non-Repudiation etc. It may also include ownership, processing details and other attributes. Further, each Information Assurance attribute may itself have sub-components like objective and subjective values or application security versus transport security. Traditionally, the Information Assurance attributes have been specified probabilistically as a belief value (or corresponding disbelief value) in that Information Assurance attribute. In this paper we introduce a framework based on Subjective Logic that incorporates uncertainty by representing values as a triple of <belief, disbelief, uncertainty>. This framework also allows us to work with conflicting Information Assurance attribute values that may arise from multiple views of an object. We also introduce a formal semantic model for specifying and reasoning over Information assurance properties in a workflow. Data Provenance information can grow substantially as the amount of information kept for each object increases as well as the complexity of a workflow increases. In such situations, it may be necessary to summarize the Data Provenance information. Further, the summarization may depend on the Information Assurance attributes as well as the type of analysis used for Data Provenance. We show how such summarization can be done and how it can be used to generate trust value in the data. We also discuss how the Information Assurance values can be visualized.

Abha Moitra, Bruce Barnett, Andrew Crapo, Stephen J. Dil
Explorations into the Provenance of High Throughput Biomedical Experiments

The field of translational biomedical informatics seeks to integrate knowledge from basic science, directed research into diseases, and clinical insights into a form that can be used to discover effective treatments of diseases. We demonstrate methods and tools to generate RDF representations of a commonly used experimental description format, MAGE-TAB, mappings of MAGE documents to two general-purpose provenance representations, OPM (Open Provenance Model) and PML (Proof Markup Language). We show through a use case simulation that the data represented in MAGE documents can be completely represented in OPM and PML through use of round trip analysis of certain examples. The success in mapping MAGE documents into general-purpose provenance models shows that promise in the implementation of the translational research provenance vision.

Jamie P. McCusker, Deborah L. McGuinness
Janus: From Workflows to Semantic Provenance and Linked Open Data

Data provenance graphs are form of metadata that can be used to establish a variety of properties of data products that undergo sequences of transformations, typically specified as workflows. Their usefulness for answering user provenance queries is limited, however, unless the graphs are enhanced with domain-specific annotations. In this paper we propose a model and architecture for semantic, domain-aware provenance, and demonstrate its usefulness in answering typical user queries. Furthermore, we discuss the additional benefits and the technical implications of publishing provenance graphs as a form of Linked Data. A prototype implementation of the model is available for data produced by the Taverna workflow system.

Paolo Missier, Satya S. Sahoo, Jun Zhao, Carole Goble, Amit Sheth
Provenance-Aware Faceted Search in Drupal

As the web content is increasingly generated in more diverse situations, provenance is becoming more and more critical. While a variety of approaches have been investigated for capturing and making use of provenance metadata, arguably no single best-practice approach has emerged. In this paper, we investigate an approach that leverages one of the most popular content management systems – Drupal. More specifically, we study how provenance metadata can be captured and later published as structured data on the Web using Drupal. We also demonstrate how provenance metadata can be used to facilitate faceted search in Drupal.

Zhenning Shangguan, Jinguang Zheng, Deborah L. McGuinness
Securing Provenance-Based Audits

Given the significant increase of on-line services that require personal information from users, the risk that such information is misused has become an important concern. In such a context, information accountability is desirable since it allows users (and society in general) to decide, by means of audits, whether information is used appropriately. To ensure information accountability, information flow should be made transparent. It has been argued that data provenance can be used as the mechanism to underpin such a transparency. Under these conditions, an audit’s quality depends on the quality of the captured provenance information. Thereby, the

integrity of provenance information

emerges as a decisive issue in the quality of a provenance-based audit. The aim of this paper is to secure provenance-based audits by the inclusion of cryptographic elements in the communication between the involved entities as well as in the provenance representation. This paper also presents a formalisation and an automatic verification of a set of security properties that increase the level of trust in provenance-based audit results.

Rocío Aldeco-Pérez, Luc Moreau
System Transparency, or How I Learned to Worry about Meaning and Love Provenance!

Web-based science analysis and processing tools allow users to access, analyze, and generate visualizations of data without requiring the user be an expert in data processing. These tools simplify science analysis for all science users by reducing the data processing overhead for the user. The benefits of these tools come with a cost, the increased need for transparency in data processing. By providing a clear explanation of the science concepts and processing performed by the science analysis tool we can increase user trust, understanding, and accountability and reduce misinterpretation or generation of inconsistent results.

We will demonstrate knowledge provenance (processing lineage and related domain information) presentation capabilities applied to an existing web-based Earth science data analysis tool (e.g. Giovanni from NASA/GSFC). Our conclusion is that user accessible visual presentations of knowledge provenance are key to building meaningful user understanding of analysis and processing decisions and should be a key component of data analysis tools.

Stephan Zednik, Peter Fox, Deborah L. McGuinness
Pedigree Management and Assessment Framework (PMAF)

The Pedigree Management and Assessment Framework (PMAF) is a customizable framework for writing, retrieving and assessing provenance and other metadata that reflects the quality of an information object (such as a document), the relationships between information objects and resources (such as people and organizations), etc. PMAF stores metadata in a volume-efficient format using RDF (Resource Description Framework), and can write and query metadata at a fine-grained level. Once metadata has been stored in PMAF, the user can run a variety of assessments (predefined queries) to reveal particular aspects of the metadata graph. We will demonstrate the PMAF browser interface, which can be used to view the existing metadata graph for an information object; the PMAF assessment interface, which allows the user to select and run predefined queries on the metadata; and the integration of PMAF with a standard document editor and content management system.

Kenneth A. McVearry
Provenance-Based Strategies to Develop Trust in Semantic Web Applications

Linked data and Semantic Web technologies enable people to navigate across heterogeneous sources of data thus making it easier for them to explore and develop multiple perspectives for use in making decisions and solving problems. While the Semantic Web offers benefits for developers and users, several new challenges are emerging that may negatively impact users’ trust in Web-based collaborative systems.

This paper describes several use cases to illustrate potential trust issues faced by Semantic Web applications, and provides a concrete example for each using a specific system we built to investigate United States Supreme Court decision making. Provenance-based solutions are proposed to develop trust and/or minimize the distrust that is provoked by the situation. While these use cases address distinct situations, they are all described in terms of how a contradiction can arise between the user’s mental model and the statements presented in the display. This commonality may be used to develop additional classes of trust-threatening use cases, and the proposed provenance-based solutions can be applied to many other Semantic Web Applications.

Xian Li, Timothy Lebo, Deborah L. McGuinness
Reflections on Provenance Ontology Encodings

As more data (especially scientific data) is digitized and put on the Web, it is desirable to make provenance metadata easy to access, reuse, integrate and reason over. Ontologies can be used to encode expectations and agreements concerning provenance metadata representation and computation. This paper analyzes a selection of popular Semantic Web provenance ontologies such as the Open Provenance Model (OPM), Dublin Core (DC) and the Proof Markup Language (PML). Selected initial findings are reported in this paper: (i) concept coverage analysis – we analyze the coverage, similarities and differences among primitive concepts from different provenance ontologies, based on identified themes; and (ii) concept modeling analysis – we analyze how Semantic Web language features were used to support computational provenance semantics. We expect the outcome of this work to provide guidance for understanding, aligning and evolving existing provenance ontologies.

Li Ding, Jie Bao, James R. Michaelis, Jun Zhao, Deborah L. McGuinness
Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance

Provenance graphs capture flow and dependency information recorded during scientific workflow runs, which can be used subsequently to interpret, validate, and debug workflow results. In this paper, we propose the new concept of

Abstract Provenance Graphs

(APGs). APGs are created via static analysis of a configured workflow

W

and input data schema, i.e.,

before

W

is actually executed. They summarize

all

possible provenance graphs the workflow

W

can create with input data of type

τ

, that is, for each input

v

 ∈ 

τ

there exists a graph homomorphism

$\mathcal H_v$

between the concrete and abstract provenance graph. APGs are helpful during workflow construction since (1) they make certain workflow design-bugs (e.g., selecting none or wrong input data for the actors) easy to spot; and (2) show the evolution of the overall data organization of a workflow. Moreover, after workflows have been run, APGs can be used to validate concrete provenance graphs. A more detailed version of this work is available as [14].

Daniel Zinn, Bertram Ludäscher
On the Use of Semantic Abstract Workflows Rooted on Provenance Concepts

Two challenges related to capturing provenance about scientific data are: 1) determining an adequate level of granularity to encode provenance, and 2) encoding provenance in a way that facilitates end-user interpretation and analysis. A solution to address these challenges consists in integrating two technologies: Semantic Abstract Workflows (SAWs), which are used to capture a domain expert’s understanding of a scientific process, and PML, an extensible language used to encode provenance. This paper describes relevant features of these technologies for addressing the granularity and interpretation challenges of provenance encoding and presents a discussion about their integration.

Leonardo Salayandia, Paulo Pinheiro da Silva
Provenance of Decisions in Emergency Response Environments

Mitigating the devastating ramifications of major disasters requires emergency workers to respond in a maximally efficient way. Information systems can improve their efficiency by organizing their efforts and automating many of their decisions. However, absence of documenting how decisions were made by the system prevents decisions from being reviewed to check the reasons for their making or their compliance with policies. We apply the concept of provenance to decision making in emergency response situations and use the Open Provenance Model to express provenance produced in RoboCup Rescue Simulation. We produce provenance DAGs using a novel OPM profile that conceptualizes decisions in the context of emergency response. Finally, we traverse the OPM DAGs to answer some provenance questions about those decisions.

Iman Naja, Luc Moreau, Alex Rogers
An Approach to Enhancing Workflows Provenance by Leveraging Web 2.0 to Increase Information Sharing, Collaboration and Reuse

Web 2.0 promises a more enjoyable experience for creating content by users by providing easy-to-use information sharing and collaboration tools, and focusing on user-centered design. Provenance in Scientific Workflow Management is one kind of user-generated data that can benefit from using Web 2.0. We propose a simple set of Web 2.0 technologies that is simple to implement and can be immediately leveraged by scientific users. Using Atom Syndication Protocol to represent workflow state and its provenance users can easily disseminate their scientific results. Collaboration and authoring can be facilitated by using Atom Publishing Protocol and standard Web 2.0 blogging tools to publish and annotate provenance. Users can search stored provenance by using search engines. If search results are in standard Atom Syndication Protocol, for example when search engines support OpenSearch standard, then Atom feeds can be used to monitor provenance changes increasing the likelihood of discoveries. By using those Web 2.0 standards, the value of scientific provenance data increases by making it a natural part of growing a variety of user-generated scientific (and non-scientific) content.

Aleksander Slominski
StarFlow: A Script-Centric Data Analysis Environment

We introduce StarFlow, a script-centric environment for data analysis. StarFlow has four main features: (1) extraction of control and data-flow dependencies through a novel combination of static analysis, dynamic runtime analysis, and user annotations, (2) command-line tools for exploring and propagating changes through the resulting dependency network, (3) support for workflow abstractions enabling robust parallel executions of complex analysis pipelines, and (4) a seamless interface with the Python scripting language. We describe real applications of StarFlow, including automatic parallelization of complex workflows in the cloud.

Elaine Angelino, Daniel Yamins, Margo Seltzer
GExpLine: A Tool for Supporting Experiment Composition

Scientific experiments present several advantages when modeled at high abstraction levels, independent from Scientific Workflow Management System (SWfMS) specification languages. For example, the scientist can define the scientific hypothesis in terms of algorithms and methods. Then, this high level experiment can be mapped into different scientific workflow instances. These instances can be executed by a SWfMS and take advantage of its provenance records. However, each workflow execution is often treated by the SWfMS as independent instances. There are no tools that allow modeling the conceptual experiment and linking it to the diverse workflow execution instances. This work presents GExpLine, a tool for supporting experiment composition through provenance. In an analogy to software development, it can be seen as a CASE tool while a SWfMS can be seen as an IDE. It provides a conceptual representation of the scientific experiment and automatically associates workflow executions with the concept of experiment. By using prospective provenance from the experiment, GExpLine generates corresponding workflows that can be executed by SWfMS. This paper also presents a real experiment use case that reinforces the importance of GExpLine and its prospective provenance support.

Daniel de Oliveira, Eduardo Ogasawara, Fernando Seabra, Vítor Silva, Leonardo Murta, Marta Mattoso
Data Provenance in Distributed Propagator Networks

The heterogeneous and unreliable nature of distributed systems has created a distinct need for the inclusion of provenance within their design to allow for error correction and redundancy. Many traditional distributed systems have limited provenance tracing abilities, usually included in generic workflow generation or in an application-specific way. The novel programming paradigm of distributed propagator networks allows for the inclusion of provenance from the ground up.

In this paper, I present the concept of propagator networks and demonstrate how provenance may be easily integrated into programs built using them. I also demonstrate the possibility of converting non-provenance-aware applications built using propagator networks into provenance-aware applications by simply performing a transformation of the existing program structure.

Ian Jacobi
Towards Provenance Aware Comment Tracking for Web Applications

Provenance has been demonstrated as an important component in web applications such as mashups, as a means of resolving user questions. However, such provenance may not be usable by all members of a given applications user base. In this paper, we discuss how

crowdsourcing

could be employed to allow individual users to get questions answered by the greater user base. We begin by discussing a technology-agnostic model for incorporating Provenance Aware Comment Trackers (PACTs) into web applications. Following this, we present an example of a PACT-extended application with accompanying two accompanying use cases.

James R. Michaelis, Deborah L. McGuinness
Browsing Proof Markup Language Provenance: Enhancing the Experience

Probe-It! is a browser that allows users to navigate through Proof Markup Language (PML) based provenance traces by interacting with a number of different perspectives or views [1]. These views provide specific renderings or presentations for the different kinds of provenance information defined in the PML ontology [2]. Throughout our three year experience with Probe-It! we have gathered requirements from users who have a need for browsing PML captured from theorem provers in the Thousands of Problems for Theorem Provers (TPTP) and Homeland Security domains as well as from scientific processes in areas such as solar astronomy, seismology, and environmental science. This paper briefly describes the enhancements made to Probe-It! to improve usability and performance with regards to visualization.

Nicholas Del Rio, Paulo Pinheiro da Silva, Hugo Porras
Towards a Threat Model for Provenance in e-Science

Scientists increasingly rely on workflow management systems to perform large-scale computational scientific experiments. These systems often collect provenance information that is useful in the analysis and reproduction of such experiments. On the other hand, this provenance data may be exposed to security threats which can result, for instance, in compromising the analysis of these experiments, or in illegitimate claims of attribution. In this work, we describe our ongoing work to trace security requirements for provenance systems in the context of e-Science, and propose some security controls to fulfill them.

Luiz M. R. Gadelha Jr., Marta Mattoso, Michael Wilde, Ian Foster
Provenance Support for Content Management Systems: A Drupal Example

Provenance helps with understanding data but without proper tools to share and access content, its reusability is limited. This paper describes the CI-Server framework currently being used to help scientific teams seamlessly share data and provenance about scientific research. CI-Server has been built using Drupal, a content management server workbench, with a focus on publishing and understanding the semantic content that is now available over the Web. By focusing on an open framework, scientists publish provenance related to their scientific research then leverage the semantic knowledge to understand and visualize the information.

Aída Gándara, Paulo Pinheiro da Silva
ProvenanceJS: Revealing the Provenance of Web Pages

Web pages are regularly constructed through combining content from multiple providers (e.g. photos from Flickr, quotes from the New York Times). As a result, it is often difficult for users and programmers to retrieve the provenance of a web page. Here, we present a JavaScript library, ProvenanceJS, that allows for the retrieval and visualization of the provenance information within a Web page and its embedded content. A key contribution is to demonstrate that provenance can be supported using widely deployed browser-based technologies.

Paul Groth
Integrating Provenance Data from Distributed Workflow Systems with ProvManager

Running scientific workflows in distributed environments is motivating the definition of provenance gathering approaches that are loosely coupled to the workflow execution engine. This kind of approach is interesting because it allows both storage and access to provenance data in an integrated way, even in an environment where different workflow systems work together. Therefore, we have proposed a provenance gathering strategy that is independent from the workflow system technology. This strategy has evolved into a provenance management system named ProvManager. In this paper we show how provenance data is captured along in a distributed execution environment with ProvManager and we show its web interface, in which scientists can register experiments, monitor workflow execution, and query provenance data.

Anderson Marinho, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Eduardo Ogasawara, Sérgio Manuel Serra da Cruz, Marta Mattoso
Using Data Lineage for Sub-image Processing

In the paper, we show that lineage data collected during the processing and analysis of datasets can be reused to perform selective reprocessing (at sub-image level) on datasets while the remainder of the dataset is untouched, a rather difficult process to automate without lineage.

Johnson Mwebaze, John McFarland, Danny Boxhoorn, Hugo Buddelmeijer, Edwin Valentijn
I Think Therefore I Am Someone Else: Understanding the Confusion of Granularity with Continuant/Occurrent and Related Perspective Shifts

Managing multiscale and multi-witness provenance is often assumed to involve relatively straight-forward matters of matching identifiers and recognizing composite processes and aggregate artifacts. However, the issue is much more complex and related to millennia of debate over the nature of objects and processes in the world. This work develops a set of concrete examples where such issues arise in provenance, discusses the core conceptual distinctions involved, and postulates a basic mechanism for extending provenance models to enable integration across granularities and process types, recognizing the OPM ‘agent’ concept as a special case.

James D. Myers
A Multi-faceted Provenance Solution for Science on the Web

To support the interface between scientific research and the wider public policy agenda it is essential to make the provenance of research processes and artefacts more transparent and subject to scrutiny. We outline the requirements for a multi-faceted approach to provenance and present a Web-based virtual research environment (ourSpaces) to demonstrate how research artefacts, projects, geographical locations and online communications can be linked in order to facilitate collaborative research.

Edoardo Pignotti, Peter Edwards, Richard Reid
Social Web-Scale Provenance in the Cloud

The lower barrier to entry for users to create and share resources through applications like Facebook and Twitter, and the commoditization of social Web data has heightened issues of privacy, attribution, and copyright. These make it important to track the provenance of social Web data. We outline and discuss key engineering, privacy, and monetization challenges in collecting and analyzing provenance of social Web resources.

Yogesh Simmhan, Karthik Gomadam
Using Domain Requirements to Achieve Science-Oriented Provenance

The US Department of Energy (DOE) Atmospheric Radiation Measurement Program (ARM) is adopting the use of formalized provenance to support observational data products produced by ARM operations and relied upon by researchers. Because of the diversity of needs in the climate community provenance will need to be conveyed in a domain-oriented context. This paper explores a use case where semantic abstract workflows (SAW) are employed as a means to filter, aggregate, and contextually describe the historical events responsible for the ARM data product the scientist is relying upon.

Eric Stephan, Todd Halter, Terence Critchlow, Paulo Pinheiro da Silva, Leonardo Salayandia
Backmatter
Metadaten
Titel
Provenance and Annotation of Data and Processes
herausgegeben von
Deborah L. McGuinness
James R. Michaelis
Luc Moreau
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-17819-1
Print ISBN
978-3-642-17818-4
DOI
https://doi.org/10.1007/978-3-642-17819-1

Neuer Inhalt