Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 6th International Provenance and Annotation Workshop, IPAW 2016, held in McLean, VA, USA, in June 2016.
The 12 revised full papers, 14 poster papers, and 2 demonstration papers presentedwere carefully reviewed and selected from 54 submissions. The papers feature state-of-the-art research and practice around the automatic capture, representation, and use of provenance. They are organized in topical sections on provenance capture, provenance analysis and visualization, and provenance models and applications.



Erratum to: Trade-Offs in Automatic Provenance Capture

Without Abstract
Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, Paul Groth

Provenance Capture


RecProv: Towards Provenance-Aware User Space Record and Replay

Deterministic record and replay systems have widely been used in software debugging, failure diagnosis, and intrusion detection. In order to detect the Advanced Persistent Threat (APT), online execution needs to be recorded with acceptable runtime overhead; then, investigators can analyze the replayed execution with heavy dynamic instrumentation. While most record and replay systems rely on kernel module or OS virtualization, those running at user space are favoured for being lighter weight and more portable without any of the changes needed for OS/Kernel virtualization. On the other hand, higher level provenance data at a higher level provides dynamic analysis with system causalities and hugely increases its efficiency. Considering both benefits, we propose a provenance-aware user space record and replay system, called RecProv. RecProv is designed to provide high provenance fidelity; specifically, with versioning files from the recorded trace logs and integrity protection to provenance data through real-time trace isolation. The collected provenance provides the high-level system dependency that helps pinpoint suspicious activities where further analysis can be applied. We show that RecProv is able to output accurate provenance in both visualized graph and W3C standardized PROV-JSON formats.
Yang Ji, Sangho Lee, Wenke Lee

Tracking and Analyzing the Evolution of Provenance from Scripts

Script languages are powerful tools for scientists. Scientists use them to process data, invoke programs, and link program outputs/inputs. During the life cycle of scientific experiments, scientists compose scripts, execute them, and perform analysis on the results. Depending on the results, they modify their script to get more data to confirm the original hypothesis or to test a new hypothesis, evolving the experiment. While some tools capture provenance from the execution of scripts, most approaches focus on a single execution, leaving out the possibility to analyze the provenance evolution of the experiment as a whole. This work enables tracking and analyzing the provenance evolution gathered from scripts. Tracking the provenance evolution also helps to reconstruct the environment of previous executions for reproduction. Provenance evolution analysis allows comparison of executions to understand what has changed and supports the decision of which execution provides better results.
João Felipe Pimentel, Juliana Freire, Vanessa Braganholo, Leonardo Murta

Trade-Offs in Automatic Provenance Capture

Automatic provenance capture from arbitrary applications is a challenging problem. Different approaches to tackle this problem have evolved, most notably a. system-event trace analysis, b. compile-time static instrumentation, and c. taint flow analysis using dynamic binary instrumentation. Each of these approaches offers different trade-offs in terms of the granularity of captured provenance, integration requirements, and runtime overhead. While these aspects have been discussed separately, a systematic and detailed study, quantifying and elucidating them, is still lacking. To fill this gap, we begin to explore these trade-offs for representative examples of these approaches for automatic provenance capture by means of evaluation and measurement. We base our evaluation on UnixBench—a widely used benchmark suite within systems research. We believe this approach will make our results easier to compare with future studies.
Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, Paul Groth

Analysis of Memory Constrained Live Provenance

We conjecture that meaningful analysis of large-scale provenance can be preserved by analyzing provenance data in limited memory while the data is still in motion; that the provenance needs not be fully resident before analysis can occur. As a proof of concept, this paper defines a stream model for reasoning about provenance data in motion for Big Data provenance. We propose a novel streaming algorithm for the backward provenance query, and apply it to the live provenance captured from agent-based simulations. The performance test demonstrates high throughput, low latency and good scalability, in a distributed stream processing framework built on Apache Kafka and Spark Streaming.
Peng Chen, Tom Evans, Beth Plale

Provenance Analysis and Visualization


Analyzing Provenance Across Heterogeneous Provenance Graphs

Provenance generated by different workflow systems is generally expressed using different formats. This is not an issue when scientists analyze provenance graphs in isolation, or when they use the same workflow system. However, when analyzing heterogeneous provenance graphs from multiple systems poses a challenge. To address this problem we adopt ProvONE as an integration model, and show how different provenance databases can be converted to a global ProvONE schema. Scientists can then query this integrated database, exploring and linking provenance across several different workflows that may represent different implementations of the same experiment. To illustrate the feasibility of our approach, we developed conceptual mappings between the provenance databases of two workflow systems (e-Science Central and SciCumulus). We provide cartridges that implement these mappings and generate an integrated provenance database expressed as Prolog facts. To demonstrate its usage, we have developed Prolog rules that enable scientists to query the integrated database.
Wellington Oliveira, Paolo Missier, Kary Ocaña, Daniel de Oliveira, Vanessa Braganholo

Prov Viewer: A Graph-Based Visualization Tool for Interactive Exploration of Provenance Data

The analysis of provenance data for an experiment is often crucial to understand the achieved results. For long-running experiments or when provenance is captured at a low granularity, this analysis process can be overwhelming to the user due to the large volume of provenance data. In this paper we introduce, Prov Viewer, a provenance visualization tool that enables users to interactively explore provenance data. Among the visualization and exploratory features, we can cite zooming, filtering, and coloring. Moreover, we use of other properties such as shape and size to distinguish visual elements. These exploratory features are linked to the provenance semantics to ease the comprehension process. We also introduce collapsing and filtering strategies, allowing different levels of granularity exploration and analysis. We describe case studies that show how Prov Viewer has been successfully used to explore provenance in different domains, including games and urban data.
Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, Leonardo Murta

Intermediate Notation for Provenance and Workflow Reproducibility

We present a technique to capture retrospective provenance across a number of tools in a statistical software suite. Our goal is to facilitate portability of processes between the tools to enhance usability and to support reproducibility. We describe an intermediate notation to aid runtime capture of provenance and demonstrate conversion to an executable and editable workflow. The notation is amenable to conversion to PROV via a template expansion mechanism. We discuss the impact on our system of recording this intermediate notation in terms of runtime performance and also the benefits it brings.
Danius T. Michaelides, Richard Parker, Chris Charlton, William J. Browne, Luc Moreau

Towards the Domain Agnostic Generation of Natural Language Explanations from Provenance Graphs for Casual Users

As more systems become PROV-enabled, there will be a corresponding increase in the need to communicate provenance data directly to users. Whilst there are a number of existing methods for doing this — formally, diagrammatically, and textually — there are currently no application-generic techniques for generating linguistic explanations of provenance. The principal reason for this is that a certain amount of linguistic information is required to transform a provenance graph — such as in PROV — into a textual explanation, and if this information is not available as an annotation, this transformation is presently not possible.
In this paper, we describe how we have adapted the common ‘consensus’ architecture from the field of natural language generation to achieve this graph transformation, resulting in the novel PROVglish architecture. We then present an approach to garnering the necessary linguistic information from a PROV dataset, which involves exploiting the linguistic information informally encoded in the URIs denoting provenance resources. We finish by detailing an evaluation undertaken to assess the effectiveness of this approach to lexicalisation, demonstrating a significant improvement in terms of fluency, comprehensibility, and grammatical correctness.
Darren P. Richardson, Luc Moreau

Provenance Models and Applications


Versioning Version Trees: The Provenance of Actions that Affect Multiple Versions

Change-based provenance captures how an entity is constructed; it can be used not only as a record of the steps taken but also as a guide during the development of derivative or new analyses. This provenance is captured as a version tree which stores a set of related entities and the exact changes made in deriving one from another. Version trees are generally viewed as monotonic–new nodes may be added but none are modified or deleted. However, there are a number of operations (e.g., upgrades) where this constraint leads to inefficient and unintuitive new versions. To address this, we propose a version tree without monotonicity where nodes may be modified and new actions inserted. We also propose to track the provenance of these tree changes to ensure that past version trees are not lost. This provenance is change-based; it links versions of version trees by the actions which transform the trees. Thus, we continue to track every change that impacts the evolution of an entity, but the actions are split between direct edits and changes to the version tree that affect multiple entity definitions. We show how this provenance leads to more intuitive and efficient operations on workflows and how this hybrid provenance may be understood.
David Koop

Enabling Web Service Request Citation by Provenance Information

Geoscience Australia (GA) is a government agency that delivers much scientific data via web services for government and research use. As a science agency, the expectation is that GA will allow users of its data to be able to cite it as one would cite academic papers allowing authors of derived works to accurately represent their sources.
We present a methodology for assisting with the citation of web service requests via provenance information recording and delivery. We decompose the representation of a web service request into endurant and occurrent components, attempting to source as much information as possible about the endurant parts as organisations find these easiest to manage. We then collect references to those parts in an endurant ‘bundle’, which we make available for citation.
Our methodology is demonstrated in action within the context of an operational government science agency, GA, that publishes many thousands of datasets with persistent identifiers and many hundreds of web services but has not, until now, provided citable identifiers for web service-generated dynamic data.
Nicholas John Car, Laura S. Stanford, Aaron Sedgmen

Modelling Provenance of Sensor Data for Food Safety Compliance Checking

The Internet of Things (IoT) is resulting in ever greater volumes of low level sensor data. However, such data is meaningless without higher level context that describes why such data is needed and what useful information can be derived from it. Provenance records should play a pivotal role in supporting a range of automated processes acting on the data streams emerging from an IoT-enabled infrastructure. In this paper we discuss how such provenance can be modelled by extending an existing suite of provenance ontologies. Furthermore, we demonstrate how provenance abstractions can be inferred from sensor data annotated using the SSN ontology. A real-world application from food-safety compliance monitoring will be used throughout to illustrate our achievements to date, and the challenges that remain.
Milan Markovic, Peter Edwards, Martin Kollingbaum, Alan Rowe

Modelling Provenance Collection Points and Their Impact on Provenance Graphs

As many domains employ ever more complex systems-of-systems, capturing provenance among component systems is increasingly important. Applications such as intrusion detection, load balancing, traffic routing, and insider threat detection all involve monitoring and analyzing the data provenance. Implicit in these applications is the assumption that “good” provenance is captured (e.g. complete provenance graphs, or one full path). When attempting to provide “good” provenance for a complex system of systems, it is necessary to know “how hard” the provenance-enabling will be and the likely quality of the provenance to be produced. In this work, we provide analytical results and simulation tools to assist in the scoping of the provenance enabling process. We provide use cases of complex systems-of-systems within which users wish to capture provenance. We describe the parameters that must be taken into account when undertaking the provenance-enabling of a system of systems. We provide a tool that models the interactions and types of capture agents involved in a complex systems-of-systems, including the set of known and unknown systems in the environment. The tool provides an estimation of quantity and type of capture agents that will need to be deployed for provenance-enablement in a complex system that is not completely known.
David Gammack, Steve Scott, Adriane P. Chapman

System Demonstrations


Yin & Yang: Demonstrating Complementary Provenance from noWorkflow & YesWorkflow

The noWorkflow and YesWorkflow toolkits both enable researchers to capture, store, query, and visualize the provenance of results produced by scripts that process scientific data. noWorkflow captures prospective provenance representing the program structure of Python scripts, and retrospective provenance representing key events observed during script execution. YesWorkflow captures prospective provenance declared through annotations in the comments of scripts, and supports key retrospective provenance queries by observing what files were used or produced by the script. We demonstrate how combining complementary information gathered by noWorkflow and YesWorkflow enables provenance queries and data lineage visualizations neither tool can provide on its own.
João Felipe Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, Bertram Ludäscher

MPO: A System to Document and Analyze Distributed Heterogeneous Workflows

Large scientific experiments and simulations produce vast quantities of data. Though smaller in volume, the corresponding metadata describing the production, pedigree, and ontology, is just as important as the raw data to the scientific discovery process. Driven by the application needs of a number of large-scale distributed workflows, we develop a metadata capturing and analysis system called MPO (short for Metadata, Provenance, Ontology). It seamlessly integrates with most data analysis environments and requires a minimal amount of changes to users’ existing analysis programs. Users have the full control of how to instrument their programs to capture as much or as little information as they desire. Once captured in a database system, the workflows can be visualized and studied through a set of web-based tools. In large scientific collaborations where the workflows have been built up over decades, this ability to instrument the complex existing workflows and visualize the key interactions among the software components is tremendously useful.
Kesheng Wu, Elizabeth N. Coviello, S. M. Flanagan, Martin Greenwald, Xia Lee, Alex Romosan, David P. Schissel, Arie Shoshani, Josh Stillerman, John Wright

Joint IPAW/TaPP Poster Session


PROV-JSONLD: A JSON and Linked Data Representation for Provenance

In this paper, we propose a representation for PROV in JSON-LD, the JSON format for Linked Data, called PROV-JSONLD. As a JSON-based format, this provenance representation can be readily consumed by Web applications currently supporting JSON. As a Linked Data format, at the same time, it also represents provenance data in RDF using the PROV ontology. Hence, it is suitable for usages in both the Web and the Semantic Web.
Trung Dong Huynh, Danius T. Michaelides, Luc Moreau

Provenance as Essential Infrastructure for Data Lakes

The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and propose a reference architecture for provenance usage in a Data Lake. Finally we discuss the applicability of our tools in the proposed architecture.
Isuru Suriarachchi, Beth Plale

Provenance-Based Retrieval: Fostering Reuse and Reproducibility Across Scientific Disciplines

When computational researchers from several domains cooperate, one recurrent problem is finding tools, methods and approaches that can be used across disciplines, to enhance collaboration through reuse. The paper presents our ongoing work to meet the challenges posed by provenance-based retrieval, proposed as a solution for transdisciplinary scientific collaboration via reuse of scientific workflows. Our work is based upon a case study in molecular dynamics experiments, as part of a larger multi-scale experimental scenario.
Lucas Augusto Montalvão Costa Carvalho, Rodrigo L. Silveira, Caroline S. Pereira, Munir S. Skaf, Claudia Bauzer Medeiros

Addressing Scientific Rigor in Data Analytics Using Semantic Workflows

New NIH grants require establishing scientific rigor, i.e. applicants must provide evidence of strict application of the scientific method to ensure robust and unbiased experimental design, methodology, analysis, interpretation and reporting of results. Researchers must transparently report experimental details so others may reproduce and extend findings. Provenance can help accomplish these objectives; analytical workflows can be annotated with sufficient information for peers to understand methods and reproduce the intended results. We aim to produce enhancements to the ontology space including links between existing ontologies, terminology gap analysis and ontology terms to address gaps, and potentially a new ontology aimed at integrating the higher level data analysis planning concepts. We are developing a collection of techniques and tools to enable workflow recipes or plans to be more clearly and consistently shared, improve understanding of all analysis aspects and enable greater reuse and reproduction. We aim to show that semantic workflows can improve scientific rigor in data analysis and to demonstrate their impact in specific research domains.
John S. Erickson, John Sheehan, Kristin P. Bennett, Deborah L. McGuinness

Reconstructing Human-Generated Provenance Through Similarity-Based Clustering

In this paper, we revisit our method for reconstructing the primary sources of documents, which make up an important part of their provenance. Our method is based on the assumption that if two documents are semantically similar, there is a high chance that they also share a common source. We previously evaluated this assumption on an excerpt from a news archive, achieving 68.2 % precision and 73 % recall when reconstructing the primary sources of all articles. However, since we could not release this dataset to the public, it made our results hard to compare to others. In this work, we extend the flexibility of our method by adding a new parameter, and re-evaluate it on the human-generated dataset created for the 2014 Provenance Reconstruction Challenge. The extended method achieves up to 86 % precision and 59 % recall, and is now directly comparable to any approach that uses the same dataset.
Tom De Nies, Erik Mannens, Rik Van de Walle

Social Media Data in Research: Provenance Challenges

In this paper we argue that understanding the provenance of social media datasets and their analysis is critical to addressing challenges faced by the social science research community in terms of the reliability and reproducibility of research utilising such data. Based on analysis of existing projects that use social media data, we present a number of research questions for the provenance community, which if addressed would help increase the transparency of the research process, aid reproducibility, and facilitate data reuse in the social sciences.
David Corsar, Milan Markovic, Peter Edwards

Fine-Grained Provenance Collection over Scripts Through Program Slicing

Collecting provenance from scripts is often useful for scientists to explain and reproduce their scientific experiments. However, most existing automatic approaches capture provenance at coarse-grain, for example, the trace of user-defined functions. These approaches lack information of variable dependencies. Without this information, users may struggle to identify which functions really influenced the results, leading to the creation of false-positive provenance links. To address this problem, we propose an approach that uses dynamic program slicing for gathering provenance of Python scripts. By capturing dependencies among variables, it is possible to expose execution paths inside functions and, consequently, to create a provenance graph that accurately represents the function activations and the results they affect.
João Felipe Pimentel, Juliana Freire, Leonardo Murta, Vanessa Braganholo

Prov2ONE: An Algorithm for Automatically Constructing ProvONE Provenance Graphs

Provenance traces history within workflows and enables researchers to validate and compare their results. Currently, modelling provenance in ProvONE is an arduous task and lacks an automated approach. This paper introduces a novel algorithm, called Prov2ONE that automatically generates the ProvONE prospective provenance for scientific workflows defined in BPEL4WS. The same prospective ProvONE graph is updated with the relevant retrospective provenance, preventing provenance to be captured in various non-standard provenance models and thus enabling research communities to share, compare and analyze workflows and its associated provenance. Finally, using the Prov2ONE algorithm, a ProvONE provenance graph for the nanoscopy workflow is generated.
Ajinkya Prabhune, Aaron Zweig, Rainer Stotzka, Michael Gertz, Juergen Hesser

Implementing Unified Why- and Why-Not Provenance Through Games

Using provenance to explain why a query returns a result or why a result is missing has been studied extensively. However, the two types of questions have been approached independently of each other. We present an efficient technique for answering both types of questions for Datalog queries based on a game-theoretic model of provenance called provenance games. Our approach compiles provenance requests into Datalog and translates the resulting query into SQL to execute it on a relational database backend. We apply several novel optimizations to limit the computation to provenance relevant to a given user question.
Seokki Lee, Sven Köhler, Bertram Ludäscher, Boris Glavic

SisGExp: Rethinking Long-Tail Agronomic Experiments

Reproducibility is a major feature of Science. Even agronomic research of exemplary quality may have irreproducible empirical findings because of random or systematic error. This work presents SisGExp, a provenance-based approach that aid researchers to manage, share, and enact the computational scientific workflows that encapsulate legacy R scripts. SisGExp transparently captures provenance of R scripts and endows experiments reproducibility. SisGExp is non-intrusive, does not require users to change their working way, it wrap agronomic experiments as a scientific workflow system.
Sergio Manuel Serra da Cruz, José Antonio Pires do Nascimento

Towards Provenance Capturing of Quantified Self Data

Quantified Self or self-tracking is a growing movement where people are tracking data about themselves. Tracking the provenance of Quantified Self data is hard because usually many different devices, apps, and services are involved. Nevertheless receiving insights how the data has been acquired, how it has been processed, and who has stored and accessed it is crucial for people. We present concepts for tracking provenance in typical Quantified Self workflows. We use a provenance model based on PROV and show its feasibility with an example.
Andreas Schreiber, Doreen Seider

A Review of Guidelines and Models for Representation of Provenance Information from Neuroscience Experiments

To manage raw data from Neuroscience experiments we have to cope with the heterogeneity of data formats and the complexity of additional metadata, such as its provenance information, that need to be collected and stored. Although some progress has already been made toward the elaboration of a common description for Neuroscience experimental data, to the best of our knowledge, there is still no widely adopted standard model to describe this kind of data. In order to foster neurocientists to find and to use a structured and comprehensive model with a robust tracking of data provenance, we present a brief evaluation of guidelines and models for representation of raw data from Neuroscience experiments, focusing on how they support provenance tracking.
Margarita Ruiz-Olazar, Evandro S. Rocha, Sueli S. Rabaça, Carlos Eduardo Ribas, Amanda S. Nascimento, Kelly R. Braghetto

Tracking and Establishing Provenance of Earth Science Datasets: A NASA-Based Example

Information quality is of paramount importance to science. Accurate, scientifically vetted and statistically meaningful and, ideally, reproducible information engenders scientific trust and research opportunities. Therefore, so-called Highly Influential Scientific Assessments (HISA) such as the U.S. Third National Climate Assessment (NCA3) undergo a very rigorous process to ensure transparency and credibility. As an activity to support the transparency of such reports, the U.S. Global Change Research Program has developed the Global Change Information System (GCIS). Specifically related to the transparency of NCA3, a recent activity was carried out to trace the provenance as completely as possible for all figures in the NCA3 report that predominantly used NASA data. This paper discusses lessons learned from this activity that traces the provenance of NASA figures in a major HISA-class pdf report.
Hampapuram K. Ramapriyan, Justin C. Goldstein, Hook Hua, Robert E. Wolfe

DataONE: A Data Federation with Provenance Support

DataONE is a federated data network focusing on earth and environmental science data. We present the provenance and search features of DataONE by means of an example involving three earth scientists who interact through a DataONE Member Node. DataONE provenance systems enable reproducible research and facilitate proper attribution of scientific results transitively across generations of derived data products.
Yang Cao, Christopher Jones, Víctor Cuevas-Vicenttín, Matthew B. Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, Yaxing Wei


Weitere Informationen

Premium Partner