ABSTRACT
We investigate the application of Bayesian networks, Markov random fields, and mixture models to the problem of query answering for transaction data sets. We formulate two versions of the querying problem: the query selectivity estimation (i.e., finding exact counts for tuples in a data set) and the query generalization problem (i.e., computing the probability that a tuple will occur in new data). We show that frequent itemsets are useful for reducing the original data to a compressed representation and introduce a method to store them using an ADTree data structure. In an extension of our earlier work on this topic we propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time. Experimental results on real-world transaction data sets provide insights into various tradeoffs involving the offline time for model-building, the online time for query-answering, the memory footprint of the compressed data, and the accuracy of the estimate provided to the query.
- 1.R. Agrawal, T. Imielinski, and A. Swami. Mining association hales between sets of items in large databases. In Proceedings of A CM SIGMOD Conference, pages 207-216, 1993. Google ScholarDigital Library
- 2.R. Agrawal and R. Srikant. Fast algorithms for mining association rifles in large databases. In Proceedings of VLDB-94, pages 487-499, 1994. Google ScholarDigital Library
- 3.B. S. Anderson and A. W. Moore. ADtrees for fast counting and for fast learning of association rules. In Proceedings of KDD-98, pages 134-138, 1998.Google Scholar
- 4.A. Berger, S. D. Pietra, and V. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):pages 39-72, 1996. Google ScholarDigital Library
- 5.K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In The VLDB Journal, pages 111-122, 2000. Google ScholarDigital Library
- 6.D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with local structure. In Proceedings of UAI-97, pages 80-89, 1997. Google ScholarDigital Library
- 7.J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:pages 1470-1480, 1972.Google ScholarCross Ref
- 8.R. Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of UAI-96, pages 211-219, 1996. Google ScholarDigital Library
- 9.P. B. Gibbons and Y. Matias. New sampllng-based summary statistics for improving approximate query answers. In Proceedings A CM SIGMOD Conference, pages 331-342, 1998. Google ScholarDigital Library
- 10.R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Record, 19(2):pages 1-11, 1990. Google ScholarDigital Library
- 11.H. Mannila, D. Pavlov, and P. Smyth. Predictions with local patterns using cross-entropy. In Proceedings KDD-99, pages 357-361, 1999. Google ScholarDigital Library
- 12.Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings A CM SIGMOD Conference, pages 448-459, 1998. Google ScholarDigital Library
- 13.A. W. Moore and M. S. Lee. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, 8:pages 67-91, 1998. Google ScholarDigital Library
- 14.M. Muralikrishna and D. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of A CM SIGMOD Conference, pages 28-36, 1988. Google ScholarDigital Library
- 15.D. Parlor, H. Mannila, and P. Smyth. Probabilistic models for query approximation with large sparse binary data sets. In Proceedings of UAI-2000, pages 465-472, 2000. Google ScholarDigital Library
- 16.D. Pavlov, H. Mannila, and P. Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. In Technical report ICS TR-01-09, UC Irvine, 2001.Google Scholar
- 17.J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kanfmann Publishers Inc., 1988. Google ScholarDigital Library
- 18.V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In The VLDB Journal, pages 486-495, 1997. Google ScholarDigital Library
- 19.D. Pynadath and M. Wellman. Generalized queries on probabilistic context-free grammars. In AAAI, volume 2, pages 1285-1290, 1996. Google ScholarDigital Library
- 20.J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):pages 416-431, 1983.Google Scholar
- 21.J. Shanmugasundaram, U. Fayyad, and P. Bradley. Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In Proceedings of KDD-99, pages 223-232, 1999. Google ScholarDigital Library
Index Terms
- Probabilistic query models for transaction data
Recommendations
Query ranking in probabilistic XML data
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database TechnologyTwig queries have been extensively studied as a major fragment of XPATH queries to query XML data. In this paper, we study PXML-RANK query, (Q, k), which is to rank top-k probabilities of the answers of a twig query Q in probabilistic XML (PXML) data. A ...
Probabilistic query answering over inconsistent databases
This paper presents a framework for querying inconsistent databases in the presence of functional dependencies. Most of the works dealing with the problem of extracting reliable information from inconsistent databases are based on the notion of repair, ...
Probabilistic Reverse Top-k Query on Probabilistic Data
Databases Theory and ApplicationsAbstractReverse top-k queries have received much attention from research communities. The result of reverse top-k queries is a set of objects, which had the k-most interest based on their objects’ references. Moreover, answering the queries on ...
Comments