skip to main content
research-article

Selective provenance for datalog programs using top-k queries

Published:01 August 2015Publication History
Skip Abstract Section

Abstract

Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, and here we focus on a detailed form of provenance (how-provenance), defining it as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm based on (1) instrumenting the datalog program so that, upon evaluation, it generates only relevant provenance, and (2) efficient top-k (relevant) provenance generation, combined with bottom-up datalog evaluation. The algorithm computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarGoogle Scholar
  2. A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management. In SSDBM, pages 190--199, 1998. Google ScholarGoogle Scholar
  3. T. Arora et al. Explaining program execution in deductive systems. In DOOD, pages 101--119, 1993.Google ScholarGoogle Scholar
  4. Z. Bao, S. B. Davidson, and T. Milo. Labeling recursive workflow executions on-the-fly. In SIGMOD, pages 493--504, 2011. Google ScholarGoogle Scholar
  5. Z. Bao et al. Efficient provenance storage for relational queries. CIKM, pages 1352--1361, 2012. Google ScholarGoogle Scholar
  6. O. Benjelloun et al. Databases with uncertainty and lineage. VLDB J., 17: 243--264, 2008. Google ScholarGoogle Scholar
  7. P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst., 33(4), 2008. Google ScholarGoogle Scholar
  8. L. Chang, J. X. Yu, and L. Qin. Query ranking in probabilistic XML data. In EDBT, pages 156--167, 2009. Google ScholarGoogle Scholar
  9. A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In SIGMOD, pages 993--1006, 2008. Google ScholarGoogle Scholar
  10. J. Cheney, A. Ahmed, and U. A. Acar. Database queries that explain their work. CoRR, abs/1408.1675, 2014.Google ScholarGoogle Scholar
  11. J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarGoogle Scholar
  12. S. Cohen and B. Kimelfeld. Querying parse trees of stochastic context-free grammars. In ICDT, pages 62--75, 2010. Google ScholarGoogle Scholar
  13. D. Cohn and R. Hull. Business artifacts: A data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull., 32(3):3--9, 2009.Google ScholarGoogle Scholar
  14. S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD, pages 1345--1350, 2008. Google ScholarGoogle Scholar
  15. D. Deutch, A. Gilad, and Y. Moskovitch. selp: Selective tracking and presentation of data provenance. In ICDE, pages 1484--1487, 2015.Google ScholarGoogle Scholar
  16. D. Deutch, C. Koch, and T. Milo. On probabilistic fixpoint and markov chain query languages. In PODS, pages 215--226, 2010. Google ScholarGoogle Scholar
  17. D. Deutch, T. Milo, S. Roy, and V. Tannen. Circuits for datalog provenance. In ICDT, pages 201--212, 2014.Google ScholarGoogle Scholar
  18. D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998. Google ScholarGoogle Scholar
  19. R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5):490--501, 2012. Google ScholarGoogle Scholar
  20. I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. SSDBM, pages 37--46, 2002. Google ScholarGoogle Scholar
  21. N. Fuhr. Probabilistic datalog: a logic for powerful retrieval methods. In SIGIR, pages 282--290, 1995. Google ScholarGoogle Scholar
  22. http://www.cs.tau.ac.il/~danielde/selpFull.pdf.Google ScholarGoogle Scholar
  23. L. A. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013. Google ScholarGoogle Scholar
  24. F. Geerts and A. Poggi. On database query languages for k-relations. J. Applied Logic, 8(2):173--185, 2010.Google ScholarGoogle Scholar
  25. B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, pages 174--185, 2009. Google ScholarGoogle Scholar
  26. B. Glavic, G. Alonso, R. J. Miller, and L. M. Haas. TRAMP: understanding the behavior of schema mappings through provenance. PVLDB, 3(1):1314--1325, 2010. Google ScholarGoogle Scholar
  27. B. Glavic, R. J. Miller, and G. Alonso. Using sql for efficient generation and querying of provenance information. In In Search of Elegance in the Theory and Practice of Computation, pages 291--320. Springer, 2013.Google ScholarGoogle Scholar
  28. B. Glavic, J. Siddique, P. Andritsos, and R. J. Miller. Provenance for data mining. In Tapp, 2013. Google ScholarGoogle Scholar
  29. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007. Google ScholarGoogle Scholar
  30. D. Hull et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 34: 729--732, 2006.Google ScholarGoogle Scholar
  31. http://www.iris-reasoner.org.Google ScholarGoogle Scholar
  32. Z. G. Ives, A. Haeberlen, T. Feng, and W. Gatterbauer. Querying provenance for ranking and recommending. In TaPP, 2012. Google ScholarGoogle Scholar
  33. A. K. Jha and D. Suciu. Probabilistic databases with markoviews. PVLDB, 5(11):1160--1171, 2012. Google ScholarGoogle Scholar
  34. G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, pages 951--962, 2010. Google ScholarGoogle Scholar
  35. B. Kenig, A. Gal, and O. Strichman. A new class of lineage expressions over probabilistic databases computable in p-time. In SUM, pages 219--232, 2013.Google ScholarGoogle Scholar
  36. B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query evaluation over probabilistic XML. VLDB J., 18(5):1117--1140, 2009. Google ScholarGoogle Scholar
  37. B. Kimelfeld and Y. Sagiv. Matching twigs in probabilistic XML. In VLDB, pages 27--38, 2007. Google ScholarGoogle Scholar
  38. D. E. Knuth. A generalization of dijkstra's algorithm. Inf. Process. Lett., 6(1):1--5, 1977.Google ScholarGoogle Scholar
  39. J. Li, C. Liu, R. Zhou, and W. Wang. Top-k keyword search over probabilistic XML data. In ICDE, pages 673--684, 2011. Google ScholarGoogle Scholar
  40. B. T. Loo et al. Declarative networking: language, execution and optimization. In SIGMOD, pages 97--108, 2006. Google ScholarGoogle Scholar
  41. A. Meliou, W. Gatterbauer, and D. Suciu. Reverse data management. PVLDB, 4(12):1490--1493, 2011.Google ScholarGoogle Scholar
  42. A. Meliou and D. Suciu. Tiresias: the database oracle for how-to queries. In SIGMOD, pages 337--348, 2012. Google ScholarGoogle Scholar
  43. P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, pages 299--310, 2010. Google ScholarGoogle Scholar
  44. B. Ning, C. Liu, and J. X. Yu. Efficient processing of top-k twig queries over probabilistic XML data. World Wide Web, 16(3):299--323, 2013. Google ScholarGoogle Scholar
  45. F. Niu, C. Zhang, C. Re, and J. W. Shavlik. Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In VLDS, pages 25--28, 2012.Google ScholarGoogle Scholar
  46. D. Olteanu and J. Zavodny. Factorised representations of query results: size bounds and readability. In ICDT, pages 285--298, 2012. Google ScholarGoogle Scholar
  47. R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional programs that explain their work. In SIGPLAN, pages 365--376, 2012. Google ScholarGoogle Scholar
  48. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107--136, 2006. Google ScholarGoogle Scholar
  49. R. Ronen and O. Shmueli. Automated interaction in social networks with datalog. In CIKM, pages 1273--1276, 2010. Google ScholarGoogle Scholar
  50. S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579--1590, 2014. Google ScholarGoogle Scholar
  51. O. Shmueli and S. Tsur. Logical diagnosis of LDL programs. New Generation Comput., 9(3/4):277--304, 1991.Google ScholarGoogle Scholar
  52. Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data-driven workflows. Int. J. Web Service Res., 5(2):1--22, 2008.Google ScholarGoogle Scholar
  53. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007. Google ScholarGoogle Scholar
  54. D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google ScholarGoogle Scholar
  55. Prov-overview, w3c working group note. http://www.w3.org/TR/prov-overview/, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Selective provenance for datalog programs using top-k queries
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 12
      Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
      August 2015
      728 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2015
      Published in pvldb Volume 8, Issue 12

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader