Abstract
Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, and here we focus on a detailed form of provenance (how-provenance), defining it as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm based on (1) instrumenting the datalog program so that, upon evaluation, it generates only relevant provenance, and (2) efficient top-k (relevant) provenance generation, combined with bottom-up datalog evaluation. The algorithm computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google Scholar
- A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management. In SSDBM, pages 190--199, 1998. Google Scholar
- T. Arora et al. Explaining program execution in deductive systems. In DOOD, pages 101--119, 1993.Google Scholar
- Z. Bao, S. B. Davidson, and T. Milo. Labeling recursive workflow executions on-the-fly. In SIGMOD, pages 493--504, 2011. Google Scholar
- Z. Bao et al. Efficient provenance storage for relational queries. CIKM, pages 1352--1361, 2012. Google Scholar
- O. Benjelloun et al. Databases with uncertainty and lineage. VLDB J., 17: 243--264, 2008. Google Scholar
- P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst., 33(4), 2008. Google Scholar
- L. Chang, J. X. Yu, and L. Qin. Query ranking in probabilistic XML data. In EDBT, pages 156--167, 2009. Google Scholar
- A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In SIGMOD, pages 993--1006, 2008. Google Scholar
- J. Cheney, A. Ahmed, and U. A. Acar. Database queries that explain their work. CoRR, abs/1408.1675, 2014.Google Scholar
- J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google Scholar
- S. Cohen and B. Kimelfeld. Querying parse trees of stochastic context-free grammars. In ICDT, pages 62--75, 2010. Google Scholar
- D. Cohn and R. Hull. Business artifacts: A data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull., 32(3):3--9, 2009.Google Scholar
- S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD, pages 1345--1350, 2008. Google Scholar
- D. Deutch, A. Gilad, and Y. Moskovitch. selp: Selective tracking and presentation of data provenance. In ICDE, pages 1484--1487, 2015.Google Scholar
- D. Deutch, C. Koch, and T. Milo. On probabilistic fixpoint and markov chain query languages. In PODS, pages 215--226, 2010. Google Scholar
- D. Deutch, T. Milo, S. Roy, and V. Tannen. Circuits for datalog provenance. In ICDT, pages 201--212, 2014.Google Scholar
- D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998. Google Scholar
- R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5):490--501, 2012. Google Scholar
- I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. SSDBM, pages 37--46, 2002. Google Scholar
- N. Fuhr. Probabilistic datalog: a logic for powerful retrieval methods. In SIGIR, pages 282--290, 1995. Google Scholar
- http://www.cs.tau.ac.il/~danielde/selpFull.pdf.Google Scholar
- L. A. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013. Google Scholar
- F. Geerts and A. Poggi. On database query languages for k-relations. J. Applied Logic, 8(2):173--185, 2010.Google Scholar
- B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, pages 174--185, 2009. Google Scholar
- B. Glavic, G. Alonso, R. J. Miller, and L. M. Haas. TRAMP: understanding the behavior of schema mappings through provenance. PVLDB, 3(1):1314--1325, 2010. Google Scholar
- B. Glavic, R. J. Miller, and G. Alonso. Using sql for efficient generation and querying of provenance information. In In Search of Elegance in the Theory and Practice of Computation, pages 291--320. Springer, 2013.Google Scholar
- B. Glavic, J. Siddique, P. Andritsos, and R. J. Miller. Provenance for data mining. In Tapp, 2013. Google Scholar
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007. Google Scholar
- D. Hull et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 34: 729--732, 2006.Google Scholar
- http://www.iris-reasoner.org.Google Scholar
- Z. G. Ives, A. Haeberlen, T. Feng, and W. Gatterbauer. Querying provenance for ranking and recommending. In TaPP, 2012. Google Scholar
- A. K. Jha and D. Suciu. Probabilistic databases with markoviews. PVLDB, 5(11):1160--1171, 2012. Google Scholar
- G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, pages 951--962, 2010. Google Scholar
- B. Kenig, A. Gal, and O. Strichman. A new class of lineage expressions over probabilistic databases computable in p-time. In SUM, pages 219--232, 2013.Google Scholar
- B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query evaluation over probabilistic XML. VLDB J., 18(5):1117--1140, 2009. Google Scholar
- B. Kimelfeld and Y. Sagiv. Matching twigs in probabilistic XML. In VLDB, pages 27--38, 2007. Google Scholar
- D. E. Knuth. A generalization of dijkstra's algorithm. Inf. Process. Lett., 6(1):1--5, 1977.Google Scholar
- J. Li, C. Liu, R. Zhou, and W. Wang. Top-k keyword search over probabilistic XML data. In ICDE, pages 673--684, 2011. Google Scholar
- B. T. Loo et al. Declarative networking: language, execution and optimization. In SIGMOD, pages 97--108, 2006. Google Scholar
- A. Meliou, W. Gatterbauer, and D. Suciu. Reverse data management. PVLDB, 4(12):1490--1493, 2011.Google Scholar
- A. Meliou and D. Suciu. Tiresias: the database oracle for how-to queries. In SIGMOD, pages 337--348, 2012. Google Scholar
- P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, pages 299--310, 2010. Google Scholar
- B. Ning, C. Liu, and J. X. Yu. Efficient processing of top-k twig queries over probabilistic XML data. World Wide Web, 16(3):299--323, 2013. Google Scholar
- F. Niu, C. Zhang, C. Re, and J. W. Shavlik. Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In VLDS, pages 25--28, 2012.Google Scholar
- D. Olteanu and J. Zavodny. Factorised representations of query results: size bounds and readability. In ICDT, pages 285--298, 2012. Google Scholar
- R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional programs that explain their work. In SIGPLAN, pages 365--376, 2012. Google Scholar
- M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107--136, 2006. Google Scholar
- R. Ronen and O. Shmueli. Automated interaction in social networks with datalog. In CIKM, pages 1273--1276, 2010. Google Scholar
- S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579--1590, 2014. Google Scholar
- O. Shmueli and S. Tsur. Logical diagnosis of LDL programs. New Generation Comput., 9(3/4):277--304, 1991.Google Scholar
- Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data-driven workflows. Int. J. Web Service Res., 5(2):1--22, 2008.Google Scholar
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007. Google Scholar
- D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google Scholar
- Prov-overview, w3c working group note. http://www.w3.org/TR/prov-overview/, 2013.Google Scholar
Index Terms
- Selective provenance for datalog programs using top-k queries
Recommendations
Efficient provenance tracking for datalog using top-k queries
Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for ...
The Complexity of Why-Provenance for Datalog Queries
PACMMOD (PODS)Datalog is a powerful rule-based language that allows us to express complex recursive queries and has found numerous applications over the years. Explaining why a result to a Datalog query is obtained is an essential task towards explainable and ...
Answering top-k queries using views
VLDB '06: Proceedings of the 32nd international conference on Very large data basesThe problem of obtaining efficient answers to top-k queries has attracted a lot of research attention. Several algorithms and numerous variants of the top-k retrieval problem have been introduced in recent years. The general form of this problem ...
Comments