Abstract
There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources --- from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as shared correlations where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the random variable elimination graph (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.
- P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006. Google ScholarDigital Library
- S. Arnborg. Efficient algorithms for combinatorial problems on graphs with bounded decomposability - a survey. In BIT, 1985. Google ScholarDigital Library
- O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, 2006. Google ScholarDigital Library
- H. C. Bravo and R. Ramakrishnan. Optimizing mpf queries: decision support and prob. inference. In SIGMOD, 2007. Google ScholarDigital Library
- R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating prob. queries over imprecise data. In SIGMOD, 2003. Google ScholarDigital Library
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarDigital Library
- A. Das Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In ICDE, 2008. Google ScholarDigital Library
- R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-order probabilistic inference. In International Joint Conference on Artificial Inteligence, 2005. Google ScholarDigital Library
- A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004. Google ScholarDigital Library
- A. Dovier, C. Piazza, and A. Policriti. A fast bisimulation algorithm. In International Conference on Computer Aided Verification, 2001. Google ScholarDigital Library
- N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In International Joint Conference on Artificial Inteligence, 1999. Google ScholarDigital Library
- N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997. Google ScholarDigital Library
- L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models with link uncertainty. Journal of Machine Learning Research, 2002. Google ScholarDigital Library
- L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning. MIT Press, 2007. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarDigital Library
- C. Huang and A. Darwiche. Inference in belief networks: A procedural guide. International Journal of Approximate Reasoning, 1994.Google Scholar
- P. C. Kanellakis and S. A. Smolka. CCS expressions, finite state processes, and three problems of equivalence. In ACM Symposium on Principles of Distributed Computing, 1983. Google ScholarDigital Library
- U. Kjaerulff. Triangulation of graphs - algorithms giving small total state space. Technical report, University of Aalborg, Denmark, 1990.Google Scholar
- R. Paige and R. E. Tarjan. Three partition refinement algorithms. SIAM Journal of Computing, 1987. Google ScholarDigital Library
- D. Poole. First-order probabilistic inference. In International Joint Conference on Artificial Inteligence, 2003. Google ScholarDigital Library
- C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, 2007.Google ScholarCross Ref
- C. Re and D. Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In VLDB, 2007. Google ScholarDigital Library
- I. Rish. Efficient Reasoning in Graphical Models. PhD thesis, University of California, Irvine, 1999. Google ScholarDigital Library
- P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.Google ScholarCross Ref
- S. Singh, C. Mayfield, S. Prabhakar, S. Hambrusch, and R. Shah. Indexing uncertain categorical data. In ICDE, 2007.Google ScholarCross Ref
- S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, 2008. Google ScholarDigital Library
- N. L. Zhang and D. Poole. A simple approach to bayesian network computations. In Canadian Conference on AI, 1994.Google Scholar
Index Terms
- Exploiting shared correlations in probabilistic databases
Recommendations
Probabilistic ranking for relational databases based on correlations
PIKM '10: Proceedings of the 3rd workshop on Ph.D. students in information and knowledge managementThis paper proposes a ranking method to exploit statistical correlations among pairs of attribute values in relational databases. For a given query, the correlations of the query are aggregated with each of the attribute values in a tuple to estimate ...
Probabilistic inverse ranking queries in uncertain databases
Query processing in the uncertain database has become increasingly important due to the wide existence of uncertain data in many real applications. Different from handling precise data, the uncertain query processing needs to consider the data ...
Comments