skip to main content
research-article

Exploiting shared correlations in probabilistic databases

Published:01 August 2008Publication History
Skip Abstract Section

Abstract

There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources --- from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as shared correlations where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the random variable elimination graph (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.

References

  1. P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Arnborg. Efficient algorithms for combinatorial problems on graphs with bounded decomposability - a survey. In BIT, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. C. Bravo and R. Ramakrishnan. Optimizing mpf queries: decision support and prob. inference. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating prob. queries over imprecise data. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Das Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-order probabilistic inference. In International Joint Conference on Artificial Inteligence, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Dovier, C. Piazza, and A. Policriti. A fast bisimulation algorithm. In International Conference on Computer Aided Verification, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In International Joint Conference on Artificial Inteligence, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models with link uncertainty. Journal of Machine Learning Research, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning. MIT Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Huang and A. Darwiche. Inference in belief networks: A procedural guide. International Journal of Approximate Reasoning, 1994.Google ScholarGoogle Scholar
  17. P. C. Kanellakis and S. A. Smolka. CCS expressions, finite state processes, and three problems of equivalence. In ACM Symposium on Principles of Distributed Computing, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. U. Kjaerulff. Triangulation of graphs - algorithms giving small total state space. Technical report, University of Aalborg, Denmark, 1990.Google ScholarGoogle Scholar
  19. R. Paige and R. E. Tarjan. Three partition refinement algorithms. SIAM Journal of Computing, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Poole. First-order probabilistic inference. In International Joint Conference on Artificial Inteligence, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. C. Re and D. Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. Rish. Efficient Reasoning in Graphical Models. PhD thesis, University of California, Irvine, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  25. S. Singh, C. Mayfield, S. Prabhakar, S. Hambrusch, and R. Shah. Indexing uncertain categorical data. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  26. S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. L. Zhang and D. Poole. A simple approach to bayesian network computations. In Canadian Conference on AI, 1994.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting shared correlations in probabilistic databases

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader