ABSTRACT
Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel method to capture the itemset mining process as a Poisson binomial distribution. This model-based approach extracts frequent itemsets with a high degree of accuracy, and supports large databases. We apply our techniques to improve the performance of the algorithms for: (1) finding itemsets whose frequentness probabilities are larger than some threshold; and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate. Moreover, they are orders of magnitudes faster than previous approaches.
- A. Deshpande et al. Model-driven data acquisition in sensor networks. In VLDB, 2004. Google ScholarDigital Library
- C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining with uncertain data. In KDD, 2009. Google ScholarDigital Library
- C. Aggarwal and P. Yu. A survey of uncertain data algorithms and applications. TKDE, 21(5), 2009. Google ScholarDigital Library
- R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, 1993. Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarDigital Library
- L. L. Cam. An approximation theorem for the Poisson binomial distribution. In Pacific Journal of Mathematics, volume 10, 1960.Google Scholar
- H. Cheng, P. Yu, and J. Han. Approximate frequent itemset mining in the presence of random noise. SCKDDM, 2008.Google ScholarCross Ref
- R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003. Google ScholarDigital Library
- C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In PAKDD, 2007. Google ScholarDigital Library
- G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In SIGMOD, 2007. Google ScholarDigital Library
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarDigital Library
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. Google ScholarDigital Library
- J. Huang et al. MayBMS: A Probabilistic Database Management System. In SIGMOD, 2009. Google ScholarDigital Library
- J. Ren and S. Lee and X. Chen and B. Kao and R. Cheng and D. Cheung. Naive Bayes Classification of Uncertain Data. In ICDM, 2009. Google ScholarDigital Library
- R. Jampani, L. Perez, M. Wu, F. Xu, C. Jermaine, and P. Haas. MCDB: A Monte Carlo Approach to Managing Uncertain Data. In SIGMOD, 2008. Google ScholarDigital Library
- N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In MobiDE, 2006. Google ScholarDigital Library
- H. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In KDD, 2005. Google ScholarDigital Library
- C. Kuok, A. Fu, and M. Wong. Mining fuzzy association rules in databases. SIGMOD Record, 1998. Google ScholarDigital Library
- A. Lu, Y. Ke, J. Cheng, and W. Ng. Mining vague association rules. In DASFAA, 2007. Google ScholarDigital Library
- M. Mutsuzaki et al. Trio-one: Layering uncertainty and lineage on a conventional dbms. In CIDR, 2007.Google Scholar
- M. Yiu et al. Efficient evaluation of probabilistic advanced spatial queries on existentially uncertain data. TKDE, 21(9), 2009. Google ScholarDigital Library
- P. Sistla et al. Querying the uncertain position of moving objects. In Temporal Databases: Research and Practice. Springer Verlag, 1998.Google Scholar
- C. Stein. Approximate Computation of Expectations. Institute of Mathematical Statistics Lecture Notes - Monograph Series, 7, 1986.Google Scholar
- L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining Uncertain Data with Probabilistic Guarantees. In SIGKDD, 2010. Google ScholarDigital Library
- T. Bernecker et al. Probabilistic frequent itemset mining in uncertain databases. In KDD, 2009. Google ScholarDigital Library
- T. Jayram et al. Avatar information extraction system. IEEE Data Eng. Bulletin, 29(1), 2006.Google Scholar
- S. Tsang, B. Kao, K. Y. Yip, W. Ho, and S. Lee. Decision Trees for Uncertain Data. In ICDE, 2009. Google ScholarDigital Library
- Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. In SIGMOD, 2008. Google ScholarDigital Library
Index Terms
- Accelerating probabilistic frequent itemset mining: a model-based approach
Recommendations
Probabilistic frequent itemset mining in uncertain databases
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningProbabilistic frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied to standard "certain" transaction databases. The consideration of existential uncertainty of item(sets),...
An Analytical Study on Frequent Itemset Mining Algorithms
MIKE 2013: Proceedings of the First International Conference on Mining Intelligence and Knowledge Exploration - Volume 8284Data mining is the process of collecting, extracting and analyzing large data set from different perspectives. Fundamental and important task of data mining is the mining of frequent itemsets. Frequent itemsets play an important role in association rule ...
Model-based probabilistic frequent itemset mining
Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, ...
Comments