ABSTRACT
Provenance in the context of workflows, both for the data they derive and for their specification, is an essential component to allow for result reproducibility, sharing, and knowledge re-use in the scientific community. Several workshops have been held on the topic, and it has been the focus of many research projects and prototype systems. This tutorial provides an overview of research issues in provenance for scientific workflows, with a focus on recent literature and technology in this area. It is aimed at a general database research audience and at people who work with scientific data and workflows. We will (1) provide a general overview of scientific workflows, (2) describe research on provenance for scientific workflows and show in detail how provenance is supported in existing systems; (3) discuss emerging applications that are enabled by provenance; and (4) outline open problems and new directions for database-related research.
Supplemental Material
- W. Aalst and K. Hee. Workflow Management: Models, Methods, and Systems. MIT Press, 2002. Google ScholarDigital Library
- I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of the International Provenance and Annotation Workshop (IPAW), pages 118--132, 2006. Google ScholarDigital Library
- R. S. Barga and L. A. Digiampietri. Automatic capture and efficient storage of escience experiment provenance. Concurrency and Computation: Practice and Experience, 20(5):419--429, 2008. Google ScholarDigital Library
- C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying business processes. In VLDB, pages 343--354, 2006. Google ScholarDigital Library
- O. Biton, S. Cohen-Boulakia, S. Davidson, and C. Hara. Querying and managing provenance through user views in scientific workflows. In Proceedings of ICDE, 2008. To appear. Google ScholarDigital Library
- R. Bose, I. Foster, and L. Moreau. Report on the International Provenance and Annotation Workshop. SIGMOD Rec., 35(3):51--53, 2006. Google ScholarDigital Library
- R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37(1):1--28, 2005. Google ScholarDigital Library
- S. Bowers, T. McPhillips, and B. Ludaescher. A provenance model for collection-oriented scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):519--529, 2008. Google ScholarDigital Library
- Business Process Execution Language for Web Services. http://www.ibm.com/developerworks/library/specification/ws-bpel/.Google Scholar
- P. Buneman and W.Tan. Provenance in databases. In Proceedings of ACM SIGMOD, pages 1171--1173, 2007. Google ScholarDigital Library
- B. Clifford, I. Foster, M. Hategan, T. Stef-Praun, M. Wilde, and Y. Zhao. Tracking provenance in a virtual data grid. Concurrency and Computation: Practice and Experience, 20(5):565--575, 2008. Google ScholarDigital Library
- S. Cohen, S. C. Boulakia, and S. B. Davidson. Towards a model of provenance and user views in scientific workflows. In DILS, pages 264--279, 2006. Google ScholarDigital Library
- S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson. Addressing the provenance challenge using zoom. Concurrency and Computation: Practice and Experience, 20(5):497--506, 2008. Google ScholarDigital Library
- S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 30(4):44--50, 2007.Google Scholar
- E. Deelman and Y. Gil. NSF Workshop on Challenges of Scientific Workflows. Technical report, NSF, 2006. http://vtcpc.isi.edu/wiki/index.php/Main_Page.Google Scholar
- E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, 13(3):219--237, 2005. Google ScholarDigital Library
- I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying and automating data derivation. In Proceedings of SSDBM, pages 37--46, 2002. Google ScholarDigital Library
- J. Freire, D. Koop, E. Santos, and C. Silva. Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3), May/June 2008. To appear. Google ScholarDigital Library
- J. Freire and C. Silva. Towards enabling social analysis of scientific data. In CHI Social Data Analysis Workshop, 2008. To appear. Google ScholarDigital Library
- J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, pages 10--18, 2006. Invited paper. Google ScholarDigital Library
- D. Gannon et al. A Workshop on Scientific and Scholarly Workflow Cyberinfrastructure: Improving Interoperability, Sustainability and Platform Convergence in Scientific And Scholarly Workflow. Technical report, NSF and Mellon Foundation, 2007. https://spaces.internet2.edu/display/SciSchWorkflow.Google Scholar
- J. Golbeck and J. Hendler. A semantic web approach to tracking provenance in scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):431--439, 2008. Google ScholarDigital Library
- L. Haas. Information for people. http://www.almaden.ibm.com/cs/people/laura/ Information For People keynote.pdf, 2007. Keynote talk at ICDE.Google Scholar
- H. V. Jagadish. Making database systems usable. http://www.eecs.umich.edu/db/usable/ usability-sigmod.ppt, 2007. Keynote talk at SIGMOD. Google ScholarDigital Library
- The Kepler Project. http://kepler-project.org.Google Scholar
- J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar. Provenance trails in the wings/pegasus system. Concurrency and Computation: Practice and Experience, 20(5):587--597, 2008. Google ScholarDigital Library
- Microsoft Workflow Foundation. http://msdn2.microsoft.com/en-us/netframework/ aa663322.aspx.Google Scholar
- S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting Causal Graphs from an Open Provenance Data Model. Concurrency and Computation: Practice and Experience, 20(5):577--586, 2008. Google ScholarDigital Library
- L. Moreau and I. Foster, editors. Provenance and Annotation of Data - International Provenance and Annotation Workshop, volume 4145. Springer-Verlag, 2006. Google ScholarDigital Library
- L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson. The open provenance model, December 2007. http://eprints.ecs.soton.ac.uk/14979.Google Scholar
- S. G. Parker and C. R. Johnson. SCIRun: a scientific programming environment for computational steering. In Supercomputing, page 52, 1995. Google ScholarDigital Library
- First provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ FirstProvenanceChallenge, 2006. S. Miles, and L. Moreau (organizers).Google Scholar
- Second provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ SecondProvenanceChallenge, 2007. J. Freire, S. Miles, and L. Moreau (organizers).Google Scholar
- C. Scheidegger, D. Koop, H. Vo, J. Freire, and C. Silva. Querying and creating visualizations by analogy. IEEE Transactions on Visualization and Computer Graphics, 13(6):1560--1567, 2007. Papers from the IEEE Information Visualization Conference 2007. Google ScholarDigital Library
- C. Silva, J. Freire, and S. P. Callahan. Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering, 9(5):82--89, 2007. Google ScholarDigital Library
- Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31--36, 2005. Google ScholarDigital Library
- Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data driven workflows. International Journal of Web Services Research, Idea Group Publishing, 5:1, 2008. To Appear.Google Scholar
- Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru. Performance evaluation of the karma provenance framework for scientific workflows. In L. Moreau and I. T. Foster, editors, International Provenance and Annotation Workshop (IPAW), Chicago, IL, volume 4145 of Lecture Notes in Computer Science, pages 222--236. Springer, 2006. Google ScholarDigital Library
- The Swift System. www.ci.uchicago.edu/swift.Google Scholar
- W. C. Tan. Provenance in databases: Past, current, and future. IEEE Data Eng. Bull., 30(4):3--12, 2007.Google Scholar
- The Taverna Project. http://taverna.sourceforge.net.Google Scholar
- The Triana Project. http://www.trianacode.org.Google Scholar
- VDS - The GriPhyN Virtual Data System. http://www.ci.uchicago.edu/wiki/bin/view/VDS/VDSWeb/WebMain.Google Scholar
- F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. McKeon. Manyeyes: a site for visualization at internet scale. IEEE Transactions on Visualization and Computer Graphics, 13(6):1121--1128, 2007. Google ScholarDigital Library
- The VisTrails Project. http://www.vistrails.org.Google Scholar
- J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining taverna's semantic web of provenance. Concurrency and Computation: Practice and Experience, 20(5):463--472, 2008. Google ScholarDigital Library
Index Terms
- Provenance and scientific workflows: challenges and opportunities
Recommendations
Atomicity and provenance support for pipelined scientific workflows
Today many significant scientific discoveries are achieved through complex and distributed scientific computations that are structured and represented as scientific workflows. Although atomicity is a well studied topic in transaction processing and ...
Scientific Workflow Repeatability through Cloud-Aware Provenance
UCC '14: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud ComputingThe transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific ...
Exploring many task computing in scientific workflows
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and SupercomputersOne of the main advantages of using a scientific workflow management system (SWfMS) to orchestrate data flows among scientific activities is to control and register the whole workflow execution. The execution of activities within a workflow with high ...
Comments