ABSTRACT
Many scientists are using workflows to systematically design and run computational experiments. Once the workflow is executed, the scientist may want to publish the dataset generated as a result, to be, e.g., reused by other scientists as input to their experiments. In doing so, the scientist needs to curate such dataset by specifying metadata information that describes it, e.g. its derivation history, origins and ownership. To assist the scientist in this task, we explore in this paper the use of provenance traces collected by workflow management systems when enacting workflows. Specifically, we identify the shortcomings of such raw provenance traces in supporting the data publishing task, and propose an approach whereby distilled, yet more informative, provenance traces that are fit for the data publishing task can be derived.
- Recommended practices for citation of data published through the GBIF network. (May), 2012.Google Scholar
- Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstick on pig: Enabling database-style workflow provenance. PVLDB, 5(4):346--357, 2011. Google ScholarDigital Library
- M. K. Anand, S. Bowers, and B. Ludäscher. Provenance browser: Displaying and querying scientific workflow provenance graphs. In ICDE, pages 1201--1204, 2010.Google ScholarCross Ref
- R. Bentley, J. M. Brooke, A. Csillaghy, D. Fellows, A. L. Blanc, M. Messerotti, D. Perez-Suarez, G. Pierantoni, and M. Soldati. Helio: Discovery and analysis of data in heliophysics. In eScience, pages 248--255. IEEE Computer Society, 2011. Google ScholarDigital Library
- D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In Proceedings of the 13th VLDB Conference, pages 900--911. Morgan Kaufmann, 2004. Google ScholarDigital Library
- O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. S. Hara. Querying and Managing Provenance through User Views in Scientific Workflows. 2008 IEEE 24th International Conference on Data Engineering, pages 1072--1081, Apr. 2008. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
- S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages 1345--1350, 2008. Google ScholarDigital Library
- H. V. de Sompel and C. Lagoze. All aboard: toward a machine-friendly scholarly communication system. In The Fourth Paradigm, pages 193--199. 2009.Google Scholar
- E. Deelman, D. Gannon, M. S. Shields, and I. Taylor. Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Comp. Syst., 25(5):528--540, 2009. Google ScholarDigital Library
- S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM'11, pages 225--243, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- B. Francine. Got Data? A Guide to Data Preservation in the Information Age. Communications of the ACM, 51(12):50--56, 2008. Google ScholarDigital Library
- M. Gamble and C. Goble. Quality, trust, and utility of scientific data on the web: Towards a joint model. In Proceedings of the ACM WebSci'11, Koblenz, Germany., June 2011.Google ScholarDigital Library
- D. Garijo, P. Alper, K. Belhajjame, O. Corcho, C. Goble, and Y. Gil. Common motifs in scientific workflows: An empirical analysis. In In the proceedings of the IEEE eScience Conference. IEEE CS, 2012. Google ScholarDigital Library
- T. Hey, S. Tansley, and K. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.Google Scholar
- D. Hull, R. Stevens, P. Lord, C. Wroe, and C. Goble. Treating shimantic web syndrome with ontologies. In AKT Workshop on Semantic Web Services, 2004.Google Scholar
- R. Ikeda, J. Cho, C. Fang, S. Salihoglu, S. Torikai, and J. Widom. Provenance-based debugging and drill-down in data-oriented workflows. In ICDE 2012. Stanford InfoLab. Google ScholarDigital Library
- P. Ingwersen and V. Chavan. Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure. BMC bioinformatics, 12 Suppl 1(Suppl 15):S3, Dec. 2011.Google ScholarCross Ref
- J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar. Provenance trails in the wings-pegasus system. Concurr. Comput.: Pract. Exper., 20(5):587--597, Apr. 2008. Google ScholarDigital Library
- B. F. Lavoie. Technology Watch Report The Open Archival Information System Reference Model: Introductory Guide. (January), 2004.Google Scholar
- P. Missier, S. S. Sahoo, J. Zhao, C. A. Goble, and A. P. Sheth. Janus: From workflows to semantic provenance and linked open data. In IPAW, pages 129--141, 2010.Google ScholarCross Ref
- P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic, I. Dunlop, A. Williams, T. Oinn, and C. A. Goble. Taverna, reloaded. In M. Gertz and B. Ludäscher, editors, SSDBM, volume 6187 of Lecture Notes in Computer Science, pages 471--481. Springer, 2010. Google ScholarDigital Library
- C. Scheidegger, D. Koop, E. Santos, H. Vo, S. Callahan, J. Freire, and C. Silva. Tackling the provenance challenge one layer at a time. Concurrency and Computation: Practice and Experience, 20(5):473--483, 2008. Google ScholarDigital Library
Index Terms
- Enhancing and abstracting scientific workflow provenance for data publishing
Recommendations
Provenance and scientific workflows: challenges and opportunities
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataProvenance in the context of workflows, both for the data they derive and for their specification, is an essential component to allow for result reproducibility, sharing, and knowledge re-use in the scientific community. Several workshops have been held ...
Scientific Workflow Repeatability through Cloud-Aware Provenance
UCC '14: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud ComputingThe transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific ...
Atomicity and provenance support for pipelined scientific workflows
Today many significant scientific discoveries are achieved through complex and distributed scientific computations that are structured and represented as scientific workflows. Although atomicity is a well studied topic in transaction processing and ...
Comments