skip to main content
article
Free Access

Beyond market baskets: generalizing association rules to correlations

Authors Info & Claims
Published:01 June 1997Publication History
Skip Abstract Section

Abstract

One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing item A often also purchases item B.” Motivated by the goal of generalizing beyond market baskets and the association rules used with them, we develop the notion of mining rules that identify correlations (generalizing associations), and we consider both the absence and presence of items as a basis for generating rules. We propose measuring significance of associations via the chi-squared test for correlation from classical statistics. This leads to a measure that is upward closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between correlated and uncorrelated itemsets in the lattice. We develop pruning strategies and devise an efficient algorithm for the resulting problem. We demonstrate its effectiveness by testing it on census data and finding term dependence in a corpus of text documents, as well as on synthetic data.

References

  1. 1 R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Sharer, and R. Srikant. The Quest Data Mining System. In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data, August 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the A CM SIGMOD International Conference on the Management of Data, pages 207-216, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3 R. Agrawal, T. Imielinski, and A. Swami. Database mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5:914-925, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast Discovery of Association Rules. In Fayyad et al {9}, pages 307-328, 1996. Google ScholarGoogle Scholar
  5. 5 R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the ~Oth International Conference on Very Large Data Bases, pages 487-499, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6 A. Agresti. A survey of exact inference for contingency tables. Statistical Science, 7:131-177, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  7. 7 M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and H. Tarjan. Dynamic perfect hashing: Upper and lower bounds. In Proceedings of the 18th IEEE Symposium on Foundations of Computer Science, pages 524-531, 1988.Google ScholarGoogle Scholar
  8. 8 Ft. Ewald. Keynote address. The 3rd International Conference on Information and Knowledge Management, 1994.Google ScholarGoogle Scholar
  9. 9 U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthrusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10 M. Fredman, J. Koml6s, and E. Szemer6di. Storing a sparse table with O(1) worst case access time. Journal of the A CM, 31(3):538-544, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining Optimized Association Rules for Numeric Attributes. In Proceedings of the Fifteenth A CM Symposium on Principles of Database Systems, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized association rules for numeric data. In Prvceedings of the A CM SIGMOD International Conference on Management of Data, pages 13-24, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13 J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: a relational aggregation operator generalizing groupby, cross-tab, and sub-totals. Microsoft Technical Report MSR-TR-95-22, 1995.Google ScholarGoogle Scholar
  14. 14 D. Gunopulos, H. Mannila, and S. Saluja. Discovering all most specific sentences by randomized algorithms. In Proceedings of the 6th International Conference on Database Theory, to appear, January 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15 J. Hun and Y. Fu. Discovery of multiple-level association rules from large databases. In Proceedings of the 21st International Conference on Very Large Data Bases, pages 420-431, September 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16 M. Houtsma and A. Swami. Set-oriented mining of association rules. In Proceedings of the International Conference on Data Engineering, pages 25-34, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proceedings of the 3rd International Conference on Information and Knowledge Management, pages 401-407, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18 H.O. Lancaster. The Chi-squared Distribution. John Wiley & Sons, New York, 1969.Google ScholarGoogle Scholar
  19. 19 P.S. de Laplace. Oeuvres compl~tes de Laplace publides sous les auspices de i'Acad~mie des Sciences par M.M. les secrdtaires perp~tuels. Gauthier-Villar, Paris, 1878/1912.Google ScholarGoogle Scholar
  20. 20 H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Efficient algorithms for discovering association rules. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, pages 144-155, July 1994.Google ScholarGoogle Scholar
  21. 21 A. de Moivre. Approximatio ad summam terminorum binomii (a + b)n in seriem expansi. Supplement to Miscellanea Analytica, London, 1733.Google ScholarGoogle Scholar
  22. 22 D.S. Moore. Tests of chi-squared type. In: R.B. D'Agostino and M.A. Stephens (eds), Goodness-of-Fit Techniques, Marcel Dekker, New York, 1986, pp. 63-95.Google ScholarGoogle Scholar
  23. 23 F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalists. Addison-Wesley, 1964.Google ScholarGoogle Scholar
  24. 24 J. S. Park, M. S. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In Proceedings of the A CM SIGMOD International Con/erence on Management of Data, pages 175-186, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25 K. Pearson. On a criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag., 5:157-175, 1900.Google ScholarGoogle ScholarCross RefCross Ref
  26. 26 G. Piatetsky and W. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27 A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases, pages 432-444, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28 R. Srikant and R. Agrawai. Mining generalized association rules. In Proceeding8 of the 21st International Conference on Very Larye Data Bases, pages 407-419, September 1995. Google ScholarGoogle Scholar
  29. 29 H. Toivonen. Samplinglarge databases for finding association rules. In Proceedings of the 22nd International Conference on Very Larye Data Bases, pages 134-145, September 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30 J. Widom. Research problems in data warehousing. In Proceedings of the ~th Conference on Information and Knowledge Management, November 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Beyond market baskets: generalizing association rules to correlations

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 26, Issue 2
            June 1997
            583 pages
            ISSN:0163-5808
            DOI:10.1145/253262
            Issue’s Table of Contents
            • cover image ACM Conferences
              SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data
              June 1997
              594 pages
              ISBN:0897919114
              DOI:10.1145/253260

            Copyright © 1997 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 1997

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader