research-article

Accelerating probabilistic frequent itemset mining: a model-based approach

Authors:
Liang Wang

The University of Hong Kong, Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Reynold Cheng

The University of Hong Kong, Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Sau Dan Lee

The University of Hong Kong, Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
David Cheung

The University of Hong Kong, Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong, Hong Kong
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 429–438https://doi.org/10.1145/1871437.1871494

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 429–438

ABSTRACT

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel method to capture the itemset mining process as a Poisson binomial distribution. This model-based approach extracts frequent itemsets with a high degree of accuracy, and supports large databases. We apply our techniques to improve the performance of the algorithms for: (1) finding itemsets whose frequentness probabilities are larger than some threshold; and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate. Moreover, they are orders of magnitudes faster than previous approaches.

References

A. Deshpande et al. Model-driven data acquisition in sensor networks. In VLDB, 2004. Google ScholarDigital Library
C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining with uncertain data. In KDD, 2009. Google ScholarDigital Library
C. Aggarwal and P. Yu. A survey of uncertain data algorithms and applications. TKDE, 21(5), 2009. Google ScholarDigital Library
R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, 1993. Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarDigital Library
L. L. Cam. An approximation theorem for the Poisson binomial distribution. In Pacific Journal of Mathematics, volume 10, 1960.Google Scholar
H. Cheng, P. Yu, and J. Han. Approximate frequent itemset mining in the presence of random noise. SCKDDM, 2008.Google ScholarCross Ref
R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003. Google ScholarDigital Library
C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In PAKDD, 2007. Google ScholarDigital Library
G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In SIGMOD, 2007. Google ScholarDigital Library
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarDigital Library
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. Google ScholarDigital Library
J. Huang et al. MayBMS: A Probabilistic Database Management System. In SIGMOD, 2009. Google ScholarDigital Library
J. Ren and S. Lee and X. Chen and B. Kao and R. Cheng and D. Cheung. Naive Bayes Classification of Uncertain Data. In ICDM, 2009. Google ScholarDigital Library
R. Jampani, L. Perez, M. Wu, F. Xu, C. Jermaine, and P. Haas. MCDB: A Monte Carlo Approach to Managing Uncertain Data. In SIGMOD, 2008. Google ScholarDigital Library
N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In MobiDE, 2006. Google ScholarDigital Library
H. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In KDD, 2005. Google ScholarDigital Library
C. Kuok, A. Fu, and M. Wong. Mining fuzzy association rules in databases. SIGMOD Record, 1998. Google ScholarDigital Library
A. Lu, Y. Ke, J. Cheng, and W. Ng. Mining vague association rules. In DASFAA, 2007. Google ScholarDigital Library
M. Mutsuzaki et al. Trio-one: Layering uncertainty and lineage on a conventional dbms. In CIDR, 2007.Google Scholar
M. Yiu et al. Efficient evaluation of probabilistic advanced spatial queries on existentially uncertain data. TKDE, 21(9), 2009. Google ScholarDigital Library
P. Sistla et al. Querying the uncertain position of moving objects. In Temporal Databases: Research and Practice. Springer Verlag, 1998.Google Scholar
C. Stein. Approximate Computation of Expectations. Institute of Mathematical Statistics Lecture Notes - Monograph Series, 7, 1986.Google Scholar
L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining Uncertain Data with Probabilistic Guarantees. In SIGKDD, 2010. Google ScholarDigital Library
T. Bernecker et al. Probabilistic frequent itemset mining in uncertain databases. In KDD, 2009. Google ScholarDigital Library
T. Jayram et al. Avatar information extraction system. IEEE Data Eng. Bulletin, 29(1), 2006.Google Scholar
S. Tsang, B. Kao, K. Y. Yip, W. Ho, and S. Lee. Decision Trees for Uncertain Data. In ICDE, 2009. Google ScholarDigital Library
Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. In SIGMOD, 2008. Google ScholarDigital Library

Index Terms

Accelerating probabilistic frequent itemset mining: a model-based approach
1. Information systems
  1. Information systems applications
    1. Data mining
2. Mathematics of computing
  1. Probability and statistics
    1. Distribution functions

Recommendations

Probabilistic frequent itemset mining in uncertain databases
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Probabilistic frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard "certain" transaction databases. The consideration of existential uncertainty of item(sets),...
Read More
An Analytical Study on Frequent Itemset Mining Algorithms
MIKE 2013: Proceedings of the First International Conference on Mining Intelligence and Knowledge Exploration - Volume 8284

Data mining is the process of collecting, extracting and analyzing large data set from different perspectives. Fundamental and important task of data mining is the mining of frequent itemsets. Frequent itemsets play an important role in association rule ...
Read More
Model-based probabilistic frequent itemset mining

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximation algorithm
frequent itemset
uncertain database
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 464
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerating probabilistic frequent itemset mining: a model-based approach

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Probabilistic frequent itemset mining in uncertain databases

An Analytical Study on Frequent Itemset Mining Algorithms

Model-based probabilistic frequent itemset mining