research-article

Selective provenance for datalog programs using top-k queries

Authors:
Daniel Deutch

Tel Aviv University

Tel Aviv University
View Profile

,
Amir Gilad

Tel Aviv University

Tel Aviv University
View Profile

,
Yuval Moskovitch

Tel Aviv University

Tel Aviv University
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 12pp 1394–1405https://doi.org/10.14778/2824032.2824039

Published:01 August 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, and here we focus on a detailed form of provenance (how-provenance), defining it as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm based on (1) instrumenting the datalog program so that, upon evaluation, it generates only relevant provenance, and (2) efficient top-k (relevant) provenance generation, combined with bottom-up datalog evaluation. The algorithm computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.

References

S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google Scholar
A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management. In SSDBM, pages 190--199, 1998. Google Scholar
T. Arora et al. Explaining program execution in deductive systems. In DOOD, pages 101--119, 1993.Google Scholar
Z. Bao, S. B. Davidson, and T. Milo. Labeling recursive workflow executions on-the-fly. In SIGMOD, pages 493--504, 2011. Google Scholar
Z. Bao et al. Efficient provenance storage for relational queries. CIKM, pages 1352--1361, 2012. Google Scholar
O. Benjelloun et al. Databases with uncertainty and lineage. VLDB J., 17: 243--264, 2008. Google Scholar
P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst., 33(4), 2008. Google Scholar
L. Chang, J. X. Yu, and L. Qin. Query ranking in probabilistic XML data. In EDBT, pages 156--167, 2009. Google Scholar
A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In SIGMOD, pages 993--1006, 2008. Google Scholar
J. Cheney, A. Ahmed, and U. A. Acar. Database queries that explain their work. CoRR, abs/1408.1675, 2014.Google Scholar
J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google Scholar
S. Cohen and B. Kimelfeld. Querying parse trees of stochastic context-free grammars. In ICDT, pages 62--75, 2010. Google Scholar
D. Cohn and R. Hull. Business artifacts: A data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull., 32(3):3--9, 2009.Google Scholar
S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD, pages 1345--1350, 2008. Google Scholar
D. Deutch, A. Gilad, and Y. Moskovitch. selp: Selective tracking and presentation of data provenance. In ICDE, pages 1484--1487, 2015.Google Scholar
D. Deutch, C. Koch, and T. Milo. On probabilistic fixpoint and markov chain query languages. In PODS, pages 215--226, 2010. Google Scholar
D. Deutch, T. Milo, S. Roy, and V. Tannen. Circuits for datalog provenance. In ICDT, pages 201--212, 2014.Google Scholar
D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998. Google Scholar
R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5):490--501, 2012. Google Scholar
I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. SSDBM, pages 37--46, 2002. Google Scholar
N. Fuhr. Probabilistic datalog: a logic for powerful retrieval methods. In SIGIR, pages 282--290, 1995. Google Scholar
http://www.cs.tau.ac.il/~danielde/selpFull.pdf.Google Scholar
L. A. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013. Google Scholar
F. Geerts and A. Poggi. On database query languages for k-relations. J. Applied Logic, 8(2):173--185, 2010.Google Scholar
B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, pages 174--185, 2009. Google Scholar
B. Glavic, G. Alonso, R. J. Miller, and L. M. Haas. TRAMP: understanding the behavior of schema mappings through provenance. PVLDB, 3(1):1314--1325, 2010. Google Scholar
B. Glavic, R. J. Miller, and G. Alonso. Using sql for efficient generation and querying of provenance information. In In Search of Elegance in the Theory and Practice of Computation, pages 291--320. Springer, 2013.Google Scholar
B. Glavic, J. Siddique, P. Andritsos, and R. J. Miller. Provenance for data mining. In Tapp, 2013. Google Scholar
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007. Google Scholar
D. Hull et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 34: 729--732, 2006.Google Scholar
http://www.iris-reasoner.org.Google Scholar
Z. G. Ives, A. Haeberlen, T. Feng, and W. Gatterbauer. Querying provenance for ranking and recommending. In TaPP, 2012. Google Scholar
A. K. Jha and D. Suciu. Probabilistic databases with markoviews. PVLDB, 5(11):1160--1171, 2012. Google Scholar
G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, pages 951--962, 2010. Google Scholar
B. Kenig, A. Gal, and O. Strichman. A new class of lineage expressions over probabilistic databases computable in p-time. In SUM, pages 219--232, 2013.Google Scholar
B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query evaluation over probabilistic XML. VLDB J., 18(5):1117--1140, 2009. Google Scholar
B. Kimelfeld and Y. Sagiv. Matching twigs in probabilistic XML. In VLDB, pages 27--38, 2007. Google Scholar
D. E. Knuth. A generalization of dijkstra's algorithm. Inf. Process. Lett., 6(1):1--5, 1977.Google Scholar
J. Li, C. Liu, R. Zhou, and W. Wang. Top-k keyword search over probabilistic XML data. In ICDE, pages 673--684, 2011. Google Scholar
B. T. Loo et al. Declarative networking: language, execution and optimization. In SIGMOD, pages 97--108, 2006. Google Scholar
A. Meliou, W. Gatterbauer, and D. Suciu. Reverse data management. PVLDB, 4(12):1490--1493, 2011.Google Scholar
A. Meliou and D. Suciu. Tiresias: the database oracle for how-to queries. In SIGMOD, pages 337--348, 2012. Google Scholar
P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, pages 299--310, 2010. Google Scholar
B. Ning, C. Liu, and J. X. Yu. Efficient processing of top-k twig queries over probabilistic XML data. World Wide Web, 16(3):299--323, 2013. Google Scholar
F. Niu, C. Zhang, C. Re, and J. W. Shavlik. Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In VLDS, pages 25--28, 2012.Google Scholar
D. Olteanu and J. Zavodny. Factorised representations of query results: size bounds and readability. In ICDT, pages 285--298, 2012. Google Scholar
R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional programs that explain their work. In SIGPLAN, pages 365--376, 2012. Google Scholar
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107--136, 2006. Google Scholar
R. Ronen and O. Shmueli. Automated interaction in social networks with datalog. In CIKM, pages 1273--1276, 2010. Google Scholar
S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579--1590, 2014. Google Scholar
O. Shmueli and S. Tsur. Logical diagnosis of LDL programs. New Generation Comput., 9(3/4):277--304, 1991.Google Scholar
Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data-driven workflows. Int. J. Web Service Res., 5(2):1--22, 2008.Google Scholar
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007. Google Scholar
D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google Scholar
Prov-overview, w3c working group note. http://www.w3.org/TR/prov-overview/, 2013.Google Scholar

Index Terms

Selective provenance for datalog programs using top-k queries
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient provenance tracking for datalog using top-k queries

Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for ...
Read More
The Complexity of Why-Provenance for Datalog Queries
PACMMOD (PODS)

Datalog is a powerful rule-based language that allows us to express complex recursive queries and has found numerous applications over the years. Explaining why a result to a Datalog query is obtained is an essential task towards explainable and ...
Read More
Answering top-k queries using views
VLDB '06: Proceedings of the 32nd international conference on Very large data bases

The problem of obtaining efficient answers to top-k queries has attracted a lot of research attention. Several algorithms and numerous variants of the top-k retrieval problem have been introduced in recent years. The general form of this problem ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2015
Published in pvldb Volume 8, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 114
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Selective provenance for datalog programs using top-k queries

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient provenance tracking for datalog using top-k queries

The Complexity of Why-Provenance for Datalog Queries

Answering top-k queries using views

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Selective provenance for datalog programs using top-k queries

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient provenance tracking for datalog using top-k queries

The Complexity of Why-Provenance for Datalog Queries

Answering top-k queries using views

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media