skip to main content
10.1145/502512.502536acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Probabilistic query models for transaction data

Authors Info & Claims
Published:26 August 2001Publication History

ABSTRACT

We investigate the application of Bayesian networks, Markov random fields, and mixture models to the problem of query answering for transaction data sets. We formulate two versions of the querying problem: the query selectivity estimation (i.e., finding exact counts for tuples in a data set) and the query generalization problem (i.e., computing the probability that a tuple will occur in new data). We show that frequent itemsets are useful for reducing the original data to a compressed representation and introduce a method to store them using an ADTree data structure. In an extension of our earlier work on this topic we propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time. Experimental results on real-world transaction data sets provide insights into various tradeoffs involving the offline time for model-building, the online time for query-answering, the memory footprint of the compressed data, and the accuracy of the estimate provided to the query.

References

  1. 1.R. Agrawal, T. Imielinski, and A. Swami. Mining association hales between sets of items in large databases. In Proceedings of A CM SIGMOD Conference, pages 207-216, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.R. Agrawal and R. Srikant. Fast algorithms for mining association rifles in large databases. In Proceedings of VLDB-94, pages 487-499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.B. S. Anderson and A. W. Moore. ADtrees for fast counting and for fast learning of association rules. In Proceedings of KDD-98, pages 134-138, 1998.Google ScholarGoogle Scholar
  4. 4.A. Berger, S. D. Pietra, and V. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):pages 39-72, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In The VLDB Journal, pages 111-122, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with local structure. In Proceedings of UAI-97, pages 80-89, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:pages 1470-1480, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  8. 8.R. Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of UAI-96, pages 211-219, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.P. B. Gibbons and Y. Matias. New sampllng-based summary statistics for improving approximate query answers. In Proceedings A CM SIGMOD Conference, pages 331-342, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Record, 19(2):pages 1-11, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.H. Mannila, D. Pavlov, and P. Smyth. Predictions with local patterns using cross-entropy. In Proceedings KDD-99, pages 357-361, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings A CM SIGMOD Conference, pages 448-459, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.A. W. Moore and M. S. Lee. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, 8:pages 67-91, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.M. Muralikrishna and D. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of A CM SIGMOD Conference, pages 28-36, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.D. Parlor, H. Mannila, and P. Smyth. Probabilistic models for query approximation with large sparse binary data sets. In Proceedings of UAI-2000, pages 465-472, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.D. Pavlov, H. Mannila, and P. Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. In Technical report ICS TR-01-09, UC Irvine, 2001.Google ScholarGoogle Scholar
  17. 17.J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kanfmann Publishers Inc., 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In The VLDB Journal, pages 486-495, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.D. Pynadath and M. Wellman. Generalized queries on probabilistic context-free grammars. In AAAI, volume 2, pages 1285-1290, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):pages 416-431, 1983.Google ScholarGoogle Scholar
  21. 21.J. Shanmugasundaram, U. Fayyad, and P. Bradley. Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In Proceedings of KDD-99, pages 223-232, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Probabilistic query models for transaction data

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
                August 2001
                493 pages
                ISBN:158113391X
                DOI:10.1145/502512

                Copyright © 2001 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 26 August 2001

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%

                Upcoming Conference

                KDD '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader