Abstract
Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our expectations will change. Hence, in order to avoid redundant results, knowledge discovery algorithms ideally should follow such an iterative updating procedure.
With this in mind, we introduce a well-founded approach for succinctly summarizing data with the most informative itemsets; using a probabilistic maximum entropy model, we iteratively find the itemset that provides us the most novel information—that is, for which the frequency in the data surprises us the most—and in turn we update our model accordingly. As we use the maximum entropy principle to obtain unbiased probabilistic models, and only include those itemsets that are most informative with regard to the current model, the summaries we construct are guaranteed to be both descriptive and nonredundant.
The algorithm that we present, called mtv, can either discover the top-k most informative itemsets, or we can employ either the Bayesian Information Criterion (bic) or the Minimum Description Length (mdl) principle to automatically identify the set of itemsets that together summarize the data well. In other words, our method will “tell you what you need to know” about the data. Importantly, it is a one-phase algorithm: rather than picking itemsets from a user-provided candidate set, itemsets and their supports are mined on-the-fly. To further its applicability, we provide an efficient method to compute the maximum entropy distribution using Quick Inclusion-Exclusion.
Experiments on our method, using synthetic, benchmark, and real data, show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.
- Aggarwal, C. C. and Yu, P. S. 1998. A new framework for itemset generation. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 18--24. Google ScholarDigital Library
- Baumgartner, C., Böhm, C., and Baumgartner, D. 2005. Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. Biomed. Inf. 38, 2, 89--98. Google ScholarDigital Library
- Boulicaut, J., Bykowski, A., and Rigotti, C. 2003. Free-Sets: A condensed representation of boolean data for the approximation of frequency queries. Data Min. Knowl. Discov. 7, 1, 5--22. Google ScholarDigital Library
- Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 254--260. Google ScholarDigital Library
- Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). ACM, 265--276. Google ScholarDigital Library
- Calders, T. and Goethals, B. 2006. Quick inclusion-exclusion. In Proceedings of the ECML PKDD Workshop on Knowledge Discovery in Inductive Databases (KDID). Springer, 86--103. Google ScholarDigital Library
- Calders, T. and Goethals, B. 2007. Non-Derivable itemset mining. Data Min. Knowl. Discov. 14, 1, 171--206. Google ScholarDigital Library
- Cover, T. M. and Thomas, J. A. 2006. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarDigital Library
- Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. 1999. Probabilistic networks and expert systems. In Statistics for Engineering and Information Science, M. Jordan, S. L. L. nad Jeral F. Lawless, and V. Nair, Eds., Springer.Google Scholar
- Csiszár, I. 1975. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3, 1, 146--158.Google ScholarCross Ref
- Darroch, J. and Ratcliff, D. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statis. 43, 5, 1470--1480.Google ScholarCross Ref
- De Bie, T. 2011a. An information theoretic framework for data mining. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 564--572. Google ScholarDigital Library
- De Bie, T. 2011b. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Min. Knowl. Discov. 23, 3, 407--446. Google ScholarDigital Library
- Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google Scholar
- Gallo, A., Cristianini, N., and De Bie, T. 2007. MINI: Mining informative non-redundant itemsets. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 438--445. Google ScholarDigital Library
- Garriga, G. C., Junttila, E., and Mannila, H. 2011. Banded structure in binary matrices. Knowl. Inf. Syst. 28, 1, 197--226. Google ScholarDigital Library
- Geerts, F., Goethals, B., and Mielikäinen, T. 2004. Tiling databases. In Proceedings of Discovery Science Conference. 278--289.Google Scholar
- Gelman, A., Carlin, J., Stern, H., and Rubin, D. 2004. Bayesian Data Analysis. CRC Press.Google Scholar
- Geng, L. and Hamilton, H. J. 2006. Interestingness measures for data mining: A survey. ACM Comput. Surv. 38, 3, 9. Google ScholarDigital Library
- Geurts, K., Wets, G., Brijs, T., and Vanhoof, K. 2003. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board. 1--18.Google Scholar
- Gionis, A., Mannila, H., Mielikäinen, T., and Tsaparas, P. 2007. Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1, 3, 167--176. Google ScholarDigital Library
- Goethals, B. and Zaki, M. 2004. Frequent itemset mining dataset repository (FIMI). http://fimi.ua.ac.be/.Google Scholar
- Grünwald, P. 2005. Minimum description length tutorial. In Advances in Minimum Description Length, P. Grünwald and I. Myung, Eds., MIT Press.Google Scholar
- Grünwald, P. 2007. The Minimum Description Length Principle. MIT Press.Google Scholar
- Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N., and Mannila, H. 2009. Tell me something I don't know: Randomization strategies for iterative data mining. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 379--388. Google ScholarDigital Library
- Heikinheimo, H., Fortelius, M., Erronen, J., and Mannila, H. 2007. Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. J. Biogeograph. 34, 6, 1053--1064.Google ScholarCross Ref
- Jaroszewicz, S. and Simovici, D. A. 2004. Interestingness of frequent itemsets using bayesian networks as background knowledge. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 178--186. Google ScholarDigital Library
- Jaynes, E. 1982. On the rationale of maximum-entropy methods. Proc. IEEE 70, 9, 939--952.Google ScholarCross Ref
- Kontonasios, K.-N. and De Bie, T. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM). SIAM, 153--164.Google Scholar
- Li, M. and Vitányi, P. 1993. An Introduction to Kolmogorov Complexity and its Applications. Springer. Google ScholarDigital Library
- Mampaey, M., Tatti, N., and Vreeken, J. 2011. Tell me what I need to know: Succinctly summarizing data with itemsets. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 573--581. Google ScholarDigital Library
- Mampaey, M. and Vreeken, J. 2010. Summarising data by clustering items. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 321--336. Google ScholarDigital Library
- Mitchell-Jones, A., Amori, G., Bogdanowicz, W., Krystufek, B., Reijnders, P. H., Spitzenberger, F., Stubbe, M., Thissen, J., Vohralik, V., and Zima, J. 1999. The Atlas of European Mammals. Academic Press.Google Scholar
- Myllykangas, S., Himberg, J., Böhling, T., Nagy, B., Hollmén, J., and Knuutila, S. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 55, 7324--7332.Google ScholarCross Ref
- Nijssen, S., Guns, T., and De Raedt, L. 2009. Correlated itemset mining in ROC space: A constraint programming approach. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Springer, 647--656. Google ScholarDigital Library
- Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT). ACM, 398--416. Google ScholarDigital Library
- Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 1, 465--471. Google ScholarDigital Library
- Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6, 2, 461--464.Google ScholarCross Ref
- Shaffer, J. 1995. Multiple hypothesis testing. Ann. Rev. Psychol. 46, 1, 561--584.Google ScholarCross Ref
- Siebes, A. and Kersten, R. 2011. A structure function for transaction data. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM). SIAM, 558--569.Google Scholar
- Siebes, A., Vreeken, J., and van Leeuwen, M. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). SIAM, 393--404.Google ScholarCross Ref
- Smets, K. and Vreeken, J. 2012. Slim: Directly mining descriptive patterns. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). SIAM, 236--247.Google Scholar
- Tan, P., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 32--41. Google ScholarDigital Library
- Tatti, N. 2006. Computational complexity of queries based on itemsets. Inf. Process. Lett. 98, 5, 183--187. Google ScholarDigital Library
- Tatti, N. 2008. Maximum entropy based significance of itemsets. Knowl. Inf. Syst. 17, 1, 57--77. Google ScholarDigital Library
- Tatti, N. 2010. Probably the best itemsets. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 293--302. Google ScholarDigital Library
- Tatti, N. and Heikinheimo, H. 2008. Decomposable families of itemsets. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 472--487. Google ScholarDigital Library
- Tatti, N. and Mampaey, M. 2010. Using background knowledge to rank itemsets. Data Min. Knowl. Discov. 21, 2, 293--309. Google ScholarDigital Library
- Van den Bulcke, T., Vanden Broucke, P., Van Hoof, V., Wouters, K., Vanden Broucke, S., Smits, G., Smits, E., Proesmans, S., Van Genechten, T., and Eyskens, F. 2011. Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. J. Biomed. Inf. 44, 2, 319--325. Google ScholarDigital Library
- Vreeken, J., van Leeuwen, M., and Siebes, A. 2007. Preserving privacy through data generation. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). IEEE, 685--690. Google ScholarDigital Library
- Vreeken, J., van Leeuwen, M., and Siebes, A. 2011. Krimp: Mining itemsets that compress. Data Min. Knowl. Discov. 23, 1, 169--214. Google ScholarDigital Library
- Wallace, C. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer. Google ScholarDigital Library
- Wallace, C. S. and Patrick, J. D. 1993. Coding decision trees. Mach. Learn. 11, 1, 7--22. Google ScholarDigital Library
- Wang, C. and Parthasarathy, S. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 730--735. Google ScholarDigital Library
- Webb, G. I. 2007. Discovering significant patterns. Mach. Learn. 68, 1, 1--33. Google ScholarDigital Library
- Webb, G. I. 2010. Self-Sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Trans. Knowl. Discov. Data 4, 1, 1--20. Google ScholarDigital Library
- Zaki, M. J. and Ramakrishnan, N. 2005. Reasoning about sets using redescription mining. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 364--373. Google ScholarDigital Library
Index Terms
- Summarizing data succinctly with the most informative itemsets
Recommendations
An adaptive approach to mining frequent itemsets efficiently
The mining of frequent itemsets is a fundamental and important task of data mining. To improve the efficiency in mining frequent itemsets, many researchers developed smart data structures to represent the database, and designed divide-and-conquers ...
Discovery of maximum length frequent itemsets
The use of frequent itemsets has been limited by the high computational cost as well as the large number of resulting itemsets. In many real-world scenarios, however, it is often sufficient to mine a small representative subset of frequent itemsets with ...
An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream
IEA/AIE '09: Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied IntelligenceData mining refers to the process of revealing unknown and potentially useful information from a large database. Frequent itemsets mining is one of the foundational problems in data mining, which is to discover the set of products that purchased ...
Comments