Abstract
We assume a dataset of transactions generated by a set of users over structured items where each item could be described through a set of features. In this paper, we are interested in identifying the frequent featuresets (set of features) by mining item transactions. For example, in a news website, items correspond to news articles, the features are the named-entities/topics in the articles and an item transaction would be the set of news articles read by a user within the same session. We show that mining frequent featuresets over structured item transactions is a novel problem and show that straightforward extensions of existing frequent itemset mining techniques provide unsatisfactory results. This is due to the fact that while users are drawn to each item in the transaction due to a subset of its features, the transaction by itself does not provide any information about such underlying preferred features of users. In order to overcome this hurdle, we propose a featureset uncertainty model where each item transaction could have been generated by various featuresets with different probabilities. We describe a novel approach to transform item transactions into uncertain transaction over featuresets and estimate their probabilities using constrained least squares based approach. We propose diverse algorithms to mine frequent featuresets. Our experimental evaluation provides a comparative analysis of the different approaches proposed.
- C. C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining with uncertain data. In SIGKDD, pages 29--38. ACM, 2009. Google ScholarDigital Library
- C. C. Aggarwal and P. S. Yu. A survey of uncertain data algorithms and applications. IEEE TKDE, 2009. Google ScholarDigital Library
- R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, New York, NY, USA, 1993. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, 1994. Google ScholarDigital Library
- L. Antova, T. Jansen, C. Koch, and D. Olteanu. Fast and simple relational processing of uncertain data. In ICDE, pages 983--992. IEEE, 2008. Google ScholarDigital Library
- O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB, 2006. Google ScholarDigital Library
- T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Zuefle. Probabilistic frequent itemset mining in uncertain databases. In SIGKDD, pages 119--128. ACM, 2009. Google ScholarDigital Library
- Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.Google Scholar
- C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In PAKDD, pages 47--58, 2007. Google ScholarDigital Library
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDBJ, 16(4): 523--544, 2007. Google ScholarDigital Library
- P. Damaschke. The union of minimal hitting sets: parameterized combinatorial bounds and counting. In STACS, pages 332--343, 2007. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979. Google ScholarDigital Library
- R. Gupta, G. Fang, B. Field, M. Steinbach, and V. Kumar. Quantitative evaluation of approximate frequent pattern mining algorithms. In KDD, pages 301--309, 2008. Google ScholarDigital Library
- T. Hofmann. Learning what people (don't) want. In EMCL, EMCL '01, pages 214--225, 2001. Google ScholarDigital Library
- C. K.-S. Leung, M. A. F. Mateo, and D. A. Brajczuk. A tree-based approach for frequent pattern mining from uncertain data. In AKDDM. Springer, 2008. Google ScholarDigital Library
- R. Reiter. A theory of diagnosis from first principles. Artif. Intell., 32(1): 57--95, Apr. 1987. Google ScholarDigital Library
- P. Sen, A. Deshpande, and L. Getoor. Representing tuple and attribute uncertainty in probabilistic databases. In Data Mining Workshops, ICDM, pages 507--512. IEEE, 2007. Google ScholarDigital Library
- L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic guarantees. In SIGKDD, pages 273--282. ACM, 2010. Google ScholarDigital Library
- H. Toivonen. Sampling large databases for association rules. In VLDB, pages 134--145, 1996. Google ScholarDigital Library
- Y. Tong, L. Chen, Y. Cheng, and P. S. Yu. Mining frequent itemsets over uncertain databases. VLDB, 2012. Google ScholarDigital Library
- P. Venetis, G. Koutrika, and H. Garcia-Molina. On the selection of tags for tag clouds. In WSDM, 2011. Google ScholarDigital Library
- Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. SIGMOD '08, 2008. Google ScholarDigital Library
Recommendations
Reference itemsets: useful itemsets to approximate the representation of frequent itemsets
Deriving frequent itemsets from databases is an important research issue in data mining. The number of frequent itemsets may be unusually large when a low minimum support threshold is given. As such, the design of a compact representation to compress ...
Efficient algorithms for mining high-utility itemsets in uncertain databases
High-utility itemset mining (HUIM) is a useful set of techniques for discovering patterns in transaction databases, which considers both quantity and profit of items. However, most algorithms for mining high-utility itemsets (HUIs) assume that the ...
Hiding co-occurring frequent itemsets
EDBT/ICDT '09: Proceedings of the 2009 EDBT/ICDT WorkshopsKnowledge hiding, hiding rules/patterns that are inferable from published data and attributed sensitive, is extensively studied in the literature in the context of frequent itemsets and association rules mining from transactional data. The research in ...
Comments