ABSTRACT
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
- M. O. Akinde et al. Efficient computation of subqueries in complex OLAP. ICDE '03, pages 163--174, 2003.Google ScholarCross Ref
- P. Buneman et al. Why and where: A characterization of data provenance. In ICDT '01, pages 316--330, 2001. Google ScholarDigital Library
- P. Buneman et al. On the expressiveness of implicit provenance in query and update languages. TODS, 33(4), 2008. Google ScholarDigital Library
- B. Cao et al. SQL query optimization through nested relational algebra. TODS, 32(3):18, 2007. Google ScholarDigital Library
- S. Chaudhuri. An overview of query optimization in relational systems. PODS' 98, pages 34--43, 1998. Google ScholarDigital Library
- L. Chiticariu et al. Dbnotes: a post-it system for relational databases based on provenance. In SIGMOD '05, pages 942--944, 2005. Google ScholarDigital Library
- Y. Cui et al. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000. Google ScholarDigital Library
- U. Dayal. Processing queries with quantifiers a horticultural approach. PODS '83, pages 125--136, 1983. Google ScholarDigital Library
- M. Elhemali et al. Execution strategies for SQL subqueries. SIGMOD '07, pages 993--1004, 2007. Google ScholarDigital Library
- F. Geerts et al. MONDRIAN: Annotating and querying databases through colors and blocks. Technical Report EDIINFRR0243, The University of Edinburgh, 2005.Google Scholar
- B. Glavic et al. Data provenance: A categorization of existing approaches. In BTW '07, pages 227--241, 2007.Google Scholar
- B. Glavic et al. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE '09, 2009. Google ScholarDigital Library
- T. Green et al. Provenance semirings. PODS '07, pages 31--40, 2007. Google ScholarDigital Library
- W. Kim. On Optimizing an SQL-like Nested Query. TODS, 7(3):443--469, 1982. Google ScholarDigital Library
- B. Momjian. PostgreSQL: introduction and concepts. Boston, MA: Addison-Wesley, 2001. Google ScholarDigital Library
- M. Muralikrishna et al. Improved Unnesting Algorithms for Join Aggregate SQL Queries. VLDB '92, pages 91--102, 1992. Google ScholarDigital Library
- M. Mutsuzaki et al. Trio-One: Layering uncertainty and lineage on a conventional DBMS. CIDR '07, pages 269--274, 2007.Google Scholar
- Y. L. Simmhan et al. A survey of data provenance in e-science. SIGMOD Rec., 34(3):31--36, 2005. Google ScholarDigital Library
- W. Tan et al. Provenance in Databases: Past, Current, and Future. IEEE Data Eng. Bull., 30(4):3--12, 2007.Google Scholar
- Transaction Processing Performance Council. TPC-H Benchmark Specification. http://www.tpc.org/hspec.html, 2008.Google Scholar
- Provenance for nested subqueries
Recommendations
A nested relational approach to processing SQL subqueries
SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of dataOne of the most powerful features of SQL is the use of nested queries. Most research work on the optimization of nested queries focuses on aggregate subqueries. However, the solutions proposed for non-aggregate subqueries are still limited, especially ...
Querying data provenance
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataMany advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was ...
On Provenance Minimization
Provenance information has been proved to be very effective in capturing the computational process performed by queries, and has been used extensively as the input to many advanced data management tools (e.g., view maintenance, trust assessment, or ...
Comments