research-article

Summarizing data succinctly with the most informative itemsets

Authors:
Michael Mampaey

University of Antwerp, Antwerp, Belgium

University of Antwerp, Antwerp, Belgium
View Profile

,
Jilles Vreeken

University of Antwerp, Antwerp, Belgium

University of Antwerp, Antwerp, Belgium
View Profile

,
Nikolaj Tatti

University of Antwerp, Antwerp, Belgium

University of Antwerp, Antwerp, Belgium
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 6 Issue 4Article No.: 16pp 1–42https://doi.org/10.1145/2382577.2382580

Published:18 December 2012Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our expectations will change. Hence, in order to avoid redundant results, knowledge discovery algorithms ideally should follow such an iterative updating procedure.

With this in mind, we introduce a well-founded approach for succinctly summarizing data with the most informative itemsets; using a probabilistic maximum entropy model, we iteratively find the itemset that provides us the most novel information—that is, for which the frequency in the data surprises us the most—and in turn we update our model accordingly. As we use the maximum entropy principle to obtain unbiased probabilistic models, and only include those itemsets that are most informative with regard to the current model, the summaries we construct are guaranteed to be both descriptive and nonredundant.

The algorithm that we present, called mtv, can either discover the top-k most informative itemsets, or we can employ either the Bayesian Information Criterion (bic) or the Minimum Description Length (mdl) principle to automatically identify the set of itemsets that together summarize the data well. In other words, our method will “tell you what you need to know” about the data. Importantly, it is a one-phase algorithm: rather than picking itemsets from a user-provided candidate set, itemsets and their supports are mined on-the-fly. To further its applicability, we provide an efficient method to compute the maximum entropy distribution using Quick Inclusion-Exclusion.

Experiments on our method, using synthetic, benchmark, and real data, show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.

References

Aggarwal, C. C. and Yu, P. S. 1998. A new framework for itemset generation. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 18--24. Google ScholarDigital Library
Baumgartner, C., Böhm, C., and Baumgartner, D. 2005. Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. Biomed. Inf. 38, 2, 89--98. Google ScholarDigital Library
Boulicaut, J., Bykowski, A., and Rigotti, C. 2003. Free-Sets: A condensed representation of boolean data for the approximation of frequency queries. Data Min. Knowl. Discov. 7, 1, 5--22. Google ScholarDigital Library
Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 254--260. Google ScholarDigital Library
Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). ACM, 265--276. Google ScholarDigital Library
Calders, T. and Goethals, B. 2006. Quick inclusion-exclusion. In Proceedings of the ECML PKDD Workshop on Knowledge Discovery in Inductive Databases (KDID). Springer, 86--103. Google ScholarDigital Library
Calders, T. and Goethals, B. 2007. Non-Derivable itemset mining. Data Min. Knowl. Discov. 14, 1, 171--206. Google ScholarDigital Library
Cover, T. M. and Thomas, J. A. 2006. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarDigital Library
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. 1999. Probabilistic networks and expert systems. In Statistics for Engineering and Information Science, M. Jordan, S. L. L. nad Jeral F. Lawless, and V. Nair, Eds., Springer.Google Scholar
Csiszár, I. 1975. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3, 1, 146--158.Google ScholarCross Ref
Darroch, J. and Ratcliff, D. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statis. 43, 5, 1470--1480.Google ScholarCross Ref
De Bie, T. 2011a. An information theoretic framework for data mining. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 564--572. Google ScholarDigital Library
De Bie, T. 2011b. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Min. Knowl. Discov. 23, 3, 407--446. Google ScholarDigital Library
Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google Scholar
Gallo, A., Cristianini, N., and De Bie, T. 2007. MINI: Mining informative non-redundant itemsets. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 438--445. Google ScholarDigital Library
Garriga, G. C., Junttila, E., and Mannila, H. 2011. Banded structure in binary matrices. Knowl. Inf. Syst. 28, 1, 197--226. Google ScholarDigital Library
Geerts, F., Goethals, B., and Mielikäinen, T. 2004. Tiling databases. In Proceedings of Discovery Science Conference. 278--289.Google Scholar
Gelman, A., Carlin, J., Stern, H., and Rubin, D. 2004. Bayesian Data Analysis. CRC Press.Google Scholar
Geng, L. and Hamilton, H. J. 2006. Interestingness measures for data mining: A survey. ACM Comput. Surv. 38, 3, 9. Google ScholarDigital Library
Geurts, K., Wets, G., Brijs, T., and Vanhoof, K. 2003. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board. 1--18.Google Scholar
Gionis, A., Mannila, H., Mielikäinen, T., and Tsaparas, P. 2007. Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1, 3, 167--176. Google ScholarDigital Library
Goethals, B. and Zaki, M. 2004. Frequent itemset mining dataset repository (FIMI). http://fimi.ua.ac.be/.Google Scholar
Grünwald, P. 2005. Minimum description length tutorial. In Advances in Minimum Description Length, P. Grünwald and I. Myung, Eds., MIT Press.Google Scholar
Grünwald, P. 2007. The Minimum Description Length Principle. MIT Press.Google Scholar
Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N., and Mannila, H. 2009. Tell me something I don't know: Randomization strategies for iterative data mining. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 379--388. Google ScholarDigital Library
Heikinheimo, H., Fortelius, M., Erronen, J., and Mannila, H. 2007. Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. J. Biogeograph. 34, 6, 1053--1064.Google ScholarCross Ref
Jaroszewicz, S. and Simovici, D. A. 2004. Interestingness of frequent itemsets using bayesian networks as background knowledge. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 178--186. Google ScholarDigital Library
Jaynes, E. 1982. On the rationale of maximum-entropy methods. Proc. IEEE 70, 9, 939--952.Google ScholarCross Ref
Kontonasios, K.-N. and De Bie, T. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM). SIAM, 153--164.Google Scholar
Li, M. and Vitányi, P. 1993. An Introduction to Kolmogorov Complexity and its Applications. Springer. Google ScholarDigital Library
Mampaey, M., Tatti, N., and Vreeken, J. 2011. Tell me what I need to know: Succinctly summarizing data with itemsets. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 573--581. Google ScholarDigital Library
Mampaey, M. and Vreeken, J. 2010. Summarising data by clustering items. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 321--336. Google ScholarDigital Library
Mitchell-Jones, A., Amori, G., Bogdanowicz, W., Krystufek, B., Reijnders, P. H., Spitzenberger, F., Stubbe, M., Thissen, J., Vohralik, V., and Zima, J. 1999. The Atlas of European Mammals. Academic Press.Google Scholar
Myllykangas, S., Himberg, J., Böhling, T., Nagy, B., Hollmén, J., and Knuutila, S. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 55, 7324--7332.Google ScholarCross Ref
Nijssen, S., Guns, T., and De Raedt, L. 2009. Correlated itemset mining in ROC space: A constraint programming approach. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Springer, 647--656. Google ScholarDigital Library
Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT). ACM, 398--416. Google ScholarDigital Library
Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 1, 465--471. Google ScholarDigital Library
Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6, 2, 461--464.Google ScholarCross Ref
Shaffer, J. 1995. Multiple hypothesis testing. Ann. Rev. Psychol. 46, 1, 561--584.Google ScholarCross Ref
Siebes, A. and Kersten, R. 2011. A structure function for transaction data. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM). SIAM, 558--569.Google Scholar
Siebes, A., Vreeken, J., and van Leeuwen, M. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). SIAM, 393--404.Google ScholarCross Ref
Smets, K. and Vreeken, J. 2012. Slim: Directly mining descriptive patterns. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). SIAM, 236--247.Google Scholar
Tan, P., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 32--41. Google ScholarDigital Library
Tatti, N. 2006. Computational complexity of queries based on itemsets. Inf. Process. Lett. 98, 5, 183--187. Google ScholarDigital Library
Tatti, N. 2008. Maximum entropy based significance of itemsets. Knowl. Inf. Syst. 17, 1, 57--77. Google ScholarDigital Library
Tatti, N. 2010. Probably the best itemsets. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 293--302. Google ScholarDigital Library
Tatti, N. and Heikinheimo, H. 2008. Decomposable families of itemsets. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 472--487. Google ScholarDigital Library
Tatti, N. and Mampaey, M. 2010. Using background knowledge to rank itemsets. Data Min. Knowl. Discov. 21, 2, 293--309. Google ScholarDigital Library
Van den Bulcke, T., Vanden Broucke, P., Van Hoof, V., Wouters, K., Vanden Broucke, S., Smits, G., Smits, E., Proesmans, S., Van Genechten, T., and Eyskens, F. 2011. Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. J. Biomed. Inf. 44, 2, 319--325. Google ScholarDigital Library
Vreeken, J., van Leeuwen, M., and Siebes, A. 2007. Preserving privacy through data generation. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). IEEE, 685--690. Google ScholarDigital Library
Vreeken, J., van Leeuwen, M., and Siebes, A. 2011. Krimp: Mining itemsets that compress. Data Min. Knowl. Discov. 23, 1, 169--214. Google ScholarDigital Library
Wallace, C. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer. Google ScholarDigital Library
Wallace, C. S. and Patrick, J. D. 1993. Coding decision trees. Mach. Learn. 11, 1, 7--22. Google ScholarDigital Library
Wang, C. and Parthasarathy, S. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 730--735. Google ScholarDigital Library
Webb, G. I. 2007. Discovering significant patterns. Mach. Learn. 68, 1, 1--33. Google ScholarDigital Library
Webb, G. I. 2010. Self-Sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Trans. Knowl. Discov. Data 4, 1, 1--20. Google ScholarDigital Library
Zaki, M. J. and Ramakrishnan, N. 2005. Reasoning about sets using redescription mining. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 364--373. Google ScholarDigital Library

Index Terms

Summarizing data succinctly with the most informative itemsets
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

An adaptive approach to mining frequent itemsets efficiently

The mining of frequent itemsets is a fundamental and important task of data mining. To improve the efficiency in mining frequent itemsets, many researchers developed smart data structures to represent the database, and designed divide-and-conquers ...
Read More
Discovery of maximum length frequent itemsets

The use of frequent itemsets has been limited by the high computational cost as well as the large number of resulting itemsets. In many real-world scenarios, however, it is often sufficient to mine a small representative subset of frequent itemsets with ...
Read More
An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream
IEA/AIE '09: Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence

Data mining refers to the process of revealing unknown and potentially useful information from a large database. Frequent itemsets mining is one of the foundational problems in data mining, which is to discover the set of products that purchased ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 6, Issue 4
Special Issue on the Best of SIGKDD 2011
December 2012
141 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2382577
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 December 2012
- Accepted: 1 May 2012
- Revised: 1 April 2012
- Received: 1 July 2011
Published in tkdd Volume 6, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BIC
Frequent itemsets
MDL
inclusion-exclusion
maximum entropy
minimum description length principle
pattern sets
summarization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 583
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

An adaptive approach to mining frequent itemsets efficiently

Discovery of maximum length frequent itemsets

An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

An adaptive approach to mining frequent itemsets efficiently

Discovery of maximum length frequent itemsets

An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media