Article

Probabilistic query models for transaction data

Authors:
Dmitry Pavlov

Information and Computer Science, University of California, Irvine, Irvine, CA

Information and Computer Science, University of California, Irvine, Irvine, CA
View Profile

,
Padhraic Smyth

Information and Computer Science, University of California, Irvine, Irvine, CA

Information and Computer Science, University of California, Irvine, Irvine, CA
View Profile

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2001Pages 164–173https://doi.org/10.1145/502512.502536

Published:26 August 2001Publication History

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 164–173

ABSTRACT

We investigate the application of Bayesian networks, Markov random fields, and mixture models to the problem of query answering for transaction data sets. We formulate two versions of the querying problem: the query selectivity estimation (i.e., finding exact counts for tuples in a data set) and the query generalization problem (i.e., computing the probability that a tuple will occur in new data). We show that frequent itemsets are useful for reducing the original data to a compressed representation and introduce a method to store them using an ADTree data structure. In an extension of our earlier work on this topic we propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time. Experimental results on real-world transaction data sets provide insights into various tradeoffs involving the offline time for model-building, the online time for query-answering, the memory footprint of the compressed data, and the accuracy of the estimate provided to the query.

References

1.R. Agrawal, T. Imielinski, and A. Swami. Mining association hales between sets of items in large databases. In Proceedings of A CM SIGMOD Conference, pages 207-216, 1993. Google ScholarDigital Library
2.R. Agrawal and R. Srikant. Fast algorithms for mining association rifles in large databases. In Proceedings of VLDB-94, pages 487-499, 1994. Google ScholarDigital Library
3.B. S. Anderson and A. W. Moore. ADtrees for fast counting and for fast learning of association rules. In Proceedings of KDD-98, pages 134-138, 1998.Google Scholar
4.A. Berger, S. D. Pietra, and V. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):pages 39-72, 1996. Google ScholarDigital Library
5.K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In The VLDB Journal, pages 111-122, 2000. Google ScholarDigital Library
6.D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with local structure. In Proceedings of UAI-97, pages 80-89, 1997. Google ScholarDigital Library
7.J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:pages 1470-1480, 1972.Google ScholarCross Ref
8.R. Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of UAI-96, pages 211-219, 1996. Google ScholarDigital Library
9.P. B. Gibbons and Y. Matias. New sampllng-based summary statistics for improving approximate query answers. In Proceedings A CM SIGMOD Conference, pages 331-342, 1998. Google ScholarDigital Library
10.R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Record, 19(2):pages 1-11, 1990. Google ScholarDigital Library
11.H. Mannila, D. Pavlov, and P. Smyth. Predictions with local patterns using cross-entropy. In Proceedings KDD-99, pages 357-361, 1999. Google ScholarDigital Library
12.Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings A CM SIGMOD Conference, pages 448-459, 1998. Google ScholarDigital Library
13.A. W. Moore and M. S. Lee. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, 8:pages 67-91, 1998. Google ScholarDigital Library
14.M. Muralikrishna and D. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of A CM SIGMOD Conference, pages 28-36, 1988. Google ScholarDigital Library
15.D. Parlor, H. Mannila, and P. Smyth. Probabilistic models for query approximation with large sparse binary data sets. In Proceedings of UAI-2000, pages 465-472, 2000. Google ScholarDigital Library
16.D. Pavlov, H. Mannila, and P. Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. In Technical report ICS TR-01-09, UC Irvine, 2001.Google Scholar
17.J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kanfmann Publishers Inc., 1988. Google ScholarDigital Library
18.V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In The VLDB Journal, pages 486-495, 1997. Google ScholarDigital Library
19.D. Pynadath and M. Wellman. Generalized queries on probabilistic context-free grammars. In AAAI, volume 2, pages 1285-1290, 1996. Google ScholarDigital Library
20.J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):pages 416-431, 1983.Google Scholar
21.J. Shanmugasundaram, U. Fayyad, and P. Bradley. Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In Proceedings of KDD-99, pages 223-232, 1999. Google ScholarDigital Library

Index Terms

Probabilistic query models for transaction data

Recommendations

Query ranking in probabilistic XML data
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Twig queries have been extensively studied as a major fragment of XPATH queries to query XML data. In this paper, we study PXML-RANK query, (Q, k), which is to rank top-k probabilities of the answers of a twig query Q in probabilistic XML (PXML) data. A ...
Read More
Probabilistic query answering over inconsistent databases

This paper presents a framework for querying inconsistent databases in the presence of functional dependencies. Most of the works dealing with the problem of extracting reliable information from inconsistent databases are based on the notion of repair, ...
Read More
Probabilistic Reverse Top-k Query on Probabilistic Data
Databases Theory and Applications
Abstract
Reverse top-k queries have received much attention from research communities. The result of reverse top-k queries is a set of objects, which had the k-most interest based on their objects’ references. Moreover, answering the queries on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
August 2001
493 pages
ISBN:158113391X
DOI:10.1145/502512
Conference Chair:
Doheon Lee
Chonnam National University, Korea
,
General Chair:
Mario Schkolnick
SGI
,
Program Chairs:
Foster Provost
New York University
,
Ramakrishnan Srikant
IBM Almaden Research Center
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 538
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Probabilistic query models for transaction data

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Query ranking in probabilistic XML data

Probabilistic query answering over inconsistent databases

Probabilistic Reverse Top-k Query on Probabilistic Data