skip to main content
research-article

Summarizing data succinctly with the most informative itemsets

Published:18 December 2012Publication History
Skip Abstract Section

Abstract

Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our expectations will change. Hence, in order to avoid redundant results, knowledge discovery algorithms ideally should follow such an iterative updating procedure.

With this in mind, we introduce a well-founded approach for succinctly summarizing data with the most informative itemsets; using a probabilistic maximum entropy model, we iteratively find the itemset that provides us the most novel information—that is, for which the frequency in the data surprises us the most—and in turn we update our model accordingly. As we use the maximum entropy principle to obtain unbiased probabilistic models, and only include those itemsets that are most informative with regard to the current model, the summaries we construct are guaranteed to be both descriptive and nonredundant.

The algorithm that we present, called mtv, can either discover the top-k most informative itemsets, or we can employ either the Bayesian Information Criterion (bic) or the Minimum Description Length (mdl) principle to automatically identify the set of itemsets that together summarize the data well. In other words, our method will “tell you what you need to know” about the data. Importantly, it is a one-phase algorithm: rather than picking itemsets from a user-provided candidate set, itemsets and their supports are mined on-the-fly. To further its applicability, we provide an efficient method to compute the maximum entropy distribution using Quick Inclusion-Exclusion.

Experiments on our method, using synthetic, benchmark, and real data, show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.

References

  1. Aggarwal, C. C. and Yu, P. S. 1998. A new framework for itemset generation. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 18--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baumgartner, C., Böhm, C., and Baumgartner, D. 2005. Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. Biomed. Inf. 38, 2, 89--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Boulicaut, J., Bykowski, A., and Rigotti, C. 2003. Free-Sets: A condensed representation of boolean data for the approximation of frequency queries. Data Min. Knowl. Discov. 7, 1, 5--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 254--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). ACM, 265--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Calders, T. and Goethals, B. 2006. Quick inclusion-exclusion. In Proceedings of the ECML PKDD Workshop on Knowledge Discovery in Inductive Databases (KDID). Springer, 86--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Calders, T. and Goethals, B. 2007. Non-Derivable itemset mining. Data Min. Knowl. Discov. 14, 1, 171--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cover, T. M. and Thomas, J. A. 2006. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. 1999. Probabilistic networks and expert systems. In Statistics for Engineering and Information Science, M. Jordan, S. L. L. nad Jeral F. Lawless, and V. Nair, Eds., Springer.Google ScholarGoogle Scholar
  10. Csiszár, I. 1975. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3, 1, 146--158.Google ScholarGoogle ScholarCross RefCross Ref
  11. Darroch, J. and Ratcliff, D. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statis. 43, 5, 1470--1480.Google ScholarGoogle ScholarCross RefCross Ref
  12. De Bie, T. 2011a. An information theoretic framework for data mining. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 564--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. De Bie, T. 2011b. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Min. Knowl. Discov. 23, 3, 407--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  15. Gallo, A., Cristianini, N., and De Bie, T. 2007. MINI: Mining informative non-redundant itemsets. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 438--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Garriga, G. C., Junttila, E., and Mannila, H. 2011. Banded structure in binary matrices. Knowl. Inf. Syst. 28, 1, 197--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Geerts, F., Goethals, B., and Mielikäinen, T. 2004. Tiling databases. In Proceedings of Discovery Science Conference. 278--289.Google ScholarGoogle Scholar
  18. Gelman, A., Carlin, J., Stern, H., and Rubin, D. 2004. Bayesian Data Analysis. CRC Press.Google ScholarGoogle Scholar
  19. Geng, L. and Hamilton, H. J. 2006. Interestingness measures for data mining: A survey. ACM Comput. Surv. 38, 3, 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Geurts, K., Wets, G., Brijs, T., and Vanhoof, K. 2003. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board. 1--18.Google ScholarGoogle Scholar
  21. Gionis, A., Mannila, H., Mielikäinen, T., and Tsaparas, P. 2007. Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1, 3, 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Goethals, B. and Zaki, M. 2004. Frequent itemset mining dataset repository (FIMI). http://fimi.ua.ac.be/.Google ScholarGoogle Scholar
  23. Grünwald, P. 2005. Minimum description length tutorial. In Advances in Minimum Description Length, P. Grünwald and I. Myung, Eds., MIT Press.Google ScholarGoogle Scholar
  24. Grünwald, P. 2007. The Minimum Description Length Principle. MIT Press.Google ScholarGoogle Scholar
  25. Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N., and Mannila, H. 2009. Tell me something I don't know: Randomization strategies for iterative data mining. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Heikinheimo, H., Fortelius, M., Erronen, J., and Mannila, H. 2007. Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. J. Biogeograph. 34, 6, 1053--1064.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jaroszewicz, S. and Simovici, D. A. 2004. Interestingness of frequent itemsets using bayesian networks as background knowledge. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 178--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jaynes, E. 1982. On the rationale of maximum-entropy methods. Proc. IEEE 70, 9, 939--952.Google ScholarGoogle ScholarCross RefCross Ref
  29. Kontonasios, K.-N. and De Bie, T. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM). SIAM, 153--164.Google ScholarGoogle Scholar
  30. Li, M. and Vitányi, P. 1993. An Introduction to Kolmogorov Complexity and its Applications. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mampaey, M., Tatti, N., and Vreeken, J. 2011. Tell me what I need to know: Succinctly summarizing data with itemsets. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 573--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mampaey, M. and Vreeken, J. 2010. Summarising data by clustering items. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 321--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mitchell-Jones, A., Amori, G., Bogdanowicz, W., Krystufek, B., Reijnders, P. H., Spitzenberger, F., Stubbe, M., Thissen, J., Vohralik, V., and Zima, J. 1999. The Atlas of European Mammals. Academic Press.Google ScholarGoogle Scholar
  34. Myllykangas, S., Himberg, J., Böhling, T., Nagy, B., Hollmén, J., and Knuutila, S. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 55, 7324--7332.Google ScholarGoogle ScholarCross RefCross Ref
  35. Nijssen, S., Guns, T., and De Raedt, L. 2009. Correlated itemset mining in ROC space: A constraint programming approach. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Springer, 647--656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT). ACM, 398--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 1, 465--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6, 2, 461--464.Google ScholarGoogle ScholarCross RefCross Ref
  39. Shaffer, J. 1995. Multiple hypothesis testing. Ann. Rev. Psychol. 46, 1, 561--584.Google ScholarGoogle ScholarCross RefCross Ref
  40. Siebes, A. and Kersten, R. 2011. A structure function for transaction data. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM). SIAM, 558--569.Google ScholarGoogle Scholar
  41. Siebes, A., Vreeken, J., and van Leeuwen, M. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). SIAM, 393--404.Google ScholarGoogle ScholarCross RefCross Ref
  42. Smets, K. and Vreeken, J. 2012. Slim: Directly mining descriptive patterns. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). SIAM, 236--247.Google ScholarGoogle Scholar
  43. Tan, P., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tatti, N. 2006. Computational complexity of queries based on itemsets. Inf. Process. Lett. 98, 5, 183--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Tatti, N. 2008. Maximum entropy based significance of itemsets. Knowl. Inf. Syst. 17, 1, 57--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tatti, N. 2010. Probably the best itemsets. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 293--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tatti, N. and Heikinheimo, H. 2008. Decomposable families of itemsets. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Springer, 472--487. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tatti, N. and Mampaey, M. 2010. Using background knowledge to rank itemsets. Data Min. Knowl. Discov. 21, 2, 293--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Van den Bulcke, T., Vanden Broucke, P., Van Hoof, V., Wouters, K., Vanden Broucke, S., Smits, G., Smits, E., Proesmans, S., Van Genechten, T., and Eyskens, F. 2011. Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. J. Biomed. Inf. 44, 2, 319--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Vreeken, J., van Leeuwen, M., and Siebes, A. 2007. Preserving privacy through data generation. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). IEEE, 685--690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Vreeken, J., van Leeuwen, M., and Siebes, A. 2011. Krimp: Mining itemsets that compress. Data Min. Knowl. Discov. 23, 1, 169--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Wallace, C. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wallace, C. S. and Patrick, J. D. 1993. Coding decision trees. Mach. Learn. 11, 1, 7--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Wang, C. and Parthasarathy, S. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 730--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Webb, G. I. 2007. Discovering significant patterns. Mach. Learn. 68, 1, 1--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Webb, G. I. 2010. Self-Sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Trans. Knowl. Discov. Data 4, 1, 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zaki, M. J. and Ramakrishnan, N. 2005. Reasoning about sets using redescription mining. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). ACM, 364--373. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Summarizing data succinctly with the most informative itemsets

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 4
      Special Issue on the Best of SIGKDD 2011
      December 2012
      141 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2382577
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 December 2012
      • Accepted: 1 May 2012
      • Revised: 1 April 2012
      • Received: 1 July 2011
      Published in tkdd Volume 6, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader