skip to main content
10.1145/1376616.1376772acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
demonstration

Provenance and scientific workflows: challenges and opportunities

Published:09 June 2008Publication History

ABSTRACT

Provenance in the context of workflows, both for the data they derive and for their specification, is an essential component to allow for result reproducibility, sharing, and knowledge re-use in the scientific community. Several workshops have been held on the topic, and it has been the focus of many research projects and prototype systems. This tutorial provides an overview of research issues in provenance for scientific workflows, with a focus on recent literature and technology in this area. It is aimed at a general database research audience and at people who work with scientific data and workflows. We will (1) provide a general overview of scientific workflows, (2) describe research on provenance for scientific workflows and show in detail how provenance is supported in existing systems; (3) discuss emerging applications that are enabled by provenance; and (4) outline open problems and new directions for database-related research.

Skip Supplemental Material Section

Supplemental Material

p1345-freire-complete_56k.mov

mov

107.4 MB

p1345-freire-complete_768k.mov

mov

900.8 MB

References

  1. W. Aalst and K. Hee. Workflow Management: Models, Methods, and Systems. MIT Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of the International Provenance and Annotation Workshop (IPAW), pages 118--132, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. S. Barga and L. A. Digiampietri. Automatic capture and efficient storage of escience experiment provenance. Concurrency and Computation: Practice and Experience, 20(5):419--429, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying business processes. In VLDB, pages 343--354, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. Biton, S. Cohen-Boulakia, S. Davidson, and C. Hara. Querying and managing provenance through user views in scientific workflows. In Proceedings of ICDE, 2008. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bose, I. Foster, and L. Moreau. Report on the International Provenance and Annotation Workshop. SIGMOD Rec., 35(3):51--53, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37(1):1--28, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Bowers, T. McPhillips, and B. Ludaescher. A provenance model for collection-oriented scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):519--529, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Business Process Execution Language for Web Services. http://www.ibm.com/developerworks/library/specification/ws-bpel/.Google ScholarGoogle Scholar
  10. P. Buneman and W.Tan. Provenance in databases. In Proceedings of ACM SIGMOD, pages 1171--1173, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Clifford, I. Foster, M. Hategan, T. Stef-Praun, M. Wilde, and Y. Zhao. Tracking provenance in a virtual data grid. Concurrency and Computation: Practice and Experience, 20(5):565--575, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Cohen, S. C. Boulakia, and S. B. Davidson. Towards a model of provenance and user views in scientific workflows. In DILS, pages 264--279, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson. Addressing the provenance challenge using zoom. Concurrency and Computation: Practice and Experience, 20(5):497--506, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 30(4):44--50, 2007.Google ScholarGoogle Scholar
  15. E. Deelman and Y. Gil. NSF Workshop on Challenges of Scientific Workflows. Technical report, NSF, 2006. http://vtcpc.isi.edu/wiki/index.php/Main_Page.Google ScholarGoogle Scholar
  16. E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, 13(3):219--237, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying and automating data derivation. In Proceedings of SSDBM, pages 37--46, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Freire, D. Koop, E. Santos, and C. Silva. Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3), May/June 2008. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Freire and C. Silva. Towards enabling social analysis of scientific data. In CHI Social Data Analysis Workshop, 2008. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, pages 10--18, 2006. Invited paper. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Gannon et al. A Workshop on Scientific and Scholarly Workflow Cyberinfrastructure: Improving Interoperability, Sustainability and Platform Convergence in Scientific And Scholarly Workflow. Technical report, NSF and Mellon Foundation, 2007. https://spaces.internet2.edu/display/SciSchWorkflow.Google ScholarGoogle Scholar
  22. J. Golbeck and J. Hendler. A semantic web approach to tracking provenance in scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):431--439, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Haas. Information for people. http://www.almaden.ibm.com/cs/people/laura/ Information For People keynote.pdf, 2007. Keynote talk at ICDE.Google ScholarGoogle Scholar
  24. H. V. Jagadish. Making database systems usable. http://www.eecs.umich.edu/db/usable/ usability-sigmod.ppt, 2007. Keynote talk at SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. The Kepler Project. http://kepler-project.org.Google ScholarGoogle Scholar
  26. J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar. Provenance trails in the wings/pegasus system. Concurrency and Computation: Practice and Experience, 20(5):587--597, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Microsoft Workflow Foundation. http://msdn2.microsoft.com/en-us/netframework/ aa663322.aspx.Google ScholarGoogle Scholar
  28. S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting Causal Graphs from an Open Provenance Data Model. Concurrency and Computation: Practice and Experience, 20(5):577--586, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Moreau and I. Foster, editors. Provenance and Annotation of Data - International Provenance and Annotation Workshop, volume 4145. Springer-Verlag, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson. The open provenance model, December 2007. http://eprints.ecs.soton.ac.uk/14979.Google ScholarGoogle Scholar
  31. S. G. Parker and C. R. Johnson. SCIRun: a scientific programming environment for computational steering. In Supercomputing, page 52, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. First provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ FirstProvenanceChallenge, 2006. S. Miles, and L. Moreau (organizers).Google ScholarGoogle Scholar
  33. Second provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ SecondProvenanceChallenge, 2007. J. Freire, S. Miles, and L. Moreau (organizers).Google ScholarGoogle Scholar
  34. C. Scheidegger, D. Koop, H. Vo, J. Freire, and C. Silva. Querying and creating visualizations by analogy. IEEE Transactions on Visualization and Computer Graphics, 13(6):1560--1567, 2007. Papers from the IEEE Information Visualization Conference 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Silva, J. Freire, and S. P. Callahan. Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering, 9(5):82--89, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31--36, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data driven workflows. International Journal of Web Services Research, Idea Group Publishing, 5:1, 2008. To Appear.Google ScholarGoogle Scholar
  38. Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru. Performance evaluation of the karma provenance framework for scientific workflows. In L. Moreau and I. T. Foster, editors, International Provenance and Annotation Workshop (IPAW), Chicago, IL, volume 4145 of Lecture Notes in Computer Science, pages 222--236. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. The Swift System. www.ci.uchicago.edu/swift.Google ScholarGoogle Scholar
  40. W. C. Tan. Provenance in databases: Past, current, and future. IEEE Data Eng. Bull., 30(4):3--12, 2007.Google ScholarGoogle Scholar
  41. The Taverna Project. http://taverna.sourceforge.net.Google ScholarGoogle Scholar
  42. The Triana Project. http://www.trianacode.org.Google ScholarGoogle Scholar
  43. VDS - The GriPhyN Virtual Data System. http://www.ci.uchicago.edu/wiki/bin/view/VDS/VDSWeb/WebMain.Google ScholarGoogle Scholar
  44. F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. McKeon. Manyeyes: a site for visualization at internet scale. IEEE Transactions on Visualization and Computer Graphics, 13(6):1121--1128, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. The VisTrails Project. http://www.vistrails.org.Google ScholarGoogle Scholar
  46. J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining taverna's semantic web of provenance. Concurrency and Computation: Practice and Experience, 20(5):463--472, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Provenance and scientific workflows: challenges and opportunities

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
      June 2008
      1396 pages
      ISBN:9781605581026
      DOI:10.1145/1376616

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • demonstration

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader