Abstract
One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing item A often also purchases item B.” Motivated by the goal of generalizing beyond market baskets and the association rules used with them, we develop the notion of mining rules that identify correlations (generalizing associations), and we consider both the absence and presence of items as a basis for generating rules. We propose measuring significance of associations via the chi-squared test for correlation from classical statistics. This leads to a measure that is upward closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between correlated and uncorrelated itemsets in the lattice. We develop pruning strategies and devise an efficient algorithm for the resulting problem. We demonstrate its effectiveness by testing it on census data and finding term dependence in a corpus of text documents, as well as on synthetic data.
- 1 R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Sharer, and R. Srikant. The Quest Data Mining System. In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data, August 1996.Google ScholarDigital Library
- 2 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the A CM SIGMOD International Conference on the Management of Data, pages 207-216, May 1993. Google ScholarDigital Library
- 3 R. Agrawal, T. Imielinski, and A. Swami. Database mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5:914-925, 1993. Google ScholarDigital Library
- 4 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast Discovery of Association Rules. In Fayyad et al {9}, pages 307-328, 1996. Google Scholar
- 5 R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the ~Oth International Conference on Very Large Data Bases, pages 487-499, September 1994. Google ScholarDigital Library
- 6 A. Agresti. A survey of exact inference for contingency tables. Statistical Science, 7:131-177, 1992.Google ScholarCross Ref
- 7 M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and H. Tarjan. Dynamic perfect hashing: Upper and lower bounds. In Proceedings of the 18th IEEE Symposium on Foundations of Computer Science, pages 524-531, 1988.Google Scholar
- 8 Ft. Ewald. Keynote address. The 3rd International Conference on Information and Knowledge Management, 1994.Google Scholar
- 9 U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthrusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, 1996. Google ScholarDigital Library
- 10 M. Fredman, J. Koml6s, and E. Szemer6di. Storing a sparse table with O(1) worst case access time. Journal of the A CM, 31(3):538-544, 1984. Google ScholarDigital Library
- 11 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining Optimized Association Rules for Numeric Attributes. In Proceedings of the Fifteenth A CM Symposium on Principles of Database Systems, 1996. Google ScholarDigital Library
- 12 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized association rules for numeric data. In Prvceedings of the A CM SIGMOD International Conference on Management of Data, pages 13-24, 1996. Google ScholarDigital Library
- 13 J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: a relational aggregation operator generalizing groupby, cross-tab, and sub-totals. Microsoft Technical Report MSR-TR-95-22, 1995.Google Scholar
- 14 D. Gunopulos, H. Mannila, and S. Saluja. Discovering all most specific sentences by randomized algorithms. In Proceedings of the 6th International Conference on Database Theory, to appear, January 1997. Google ScholarDigital Library
- 15 J. Hun and Y. Fu. Discovery of multiple-level association rules from large databases. In Proceedings of the 21st International Conference on Very Large Data Bases, pages 420-431, September 1995. Google ScholarDigital Library
- 16 M. Houtsma and A. Swami. Set-oriented mining of association rules. In Proceedings of the International Conference on Data Engineering, pages 25-34, 1995. Google ScholarDigital Library
- 17 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proceedings of the 3rd International Conference on Information and Knowledge Management, pages 401-407, 1994. Google ScholarDigital Library
- 18 H.O. Lancaster. The Chi-squared Distribution. John Wiley & Sons, New York, 1969.Google Scholar
- 19 P.S. de Laplace. Oeuvres compl~tes de Laplace publides sous les auspices de i'Acad~mie des Sciences par M.M. les secrdtaires perp~tuels. Gauthier-Villar, Paris, 1878/1912.Google Scholar
- 20 H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Efficient algorithms for discovering association rules. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, pages 144-155, July 1994.Google Scholar
- 21 A. de Moivre. Approximatio ad summam terminorum binomii (a + b)n in seriem expansi. Supplement to Miscellanea Analytica, London, 1733.Google Scholar
- 22 D.S. Moore. Tests of chi-squared type. In: R.B. D'Agostino and M.A. Stephens (eds), Goodness-of-Fit Techniques, Marcel Dekker, New York, 1986, pp. 63-95.Google Scholar
- 23 F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalists. Addison-Wesley, 1964.Google Scholar
- 24 J. S. Park, M. S. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In Proceedings of the A CM SIGMOD International Con/erence on Management of Data, pages 175-186, May 1995. Google ScholarDigital Library
- 25 K. Pearson. On a criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag., 5:157-175, 1900.Google ScholarCross Ref
- 26 G. Piatetsky and W. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. Google ScholarDigital Library
- 27 A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases, pages 432-444, 1995. Google ScholarDigital Library
- 28 R. Srikant and R. Agrawai. Mining generalized association rules. In Proceeding8 of the 21st International Conference on Very Larye Data Bases, pages 407-419, September 1995. Google Scholar
- 29 H. Toivonen. Samplinglarge databases for finding association rules. In Proceedings of the 22nd International Conference on Very Larye Data Bases, pages 134-145, September 1996. Google ScholarDigital Library
- 30 J. Widom. Research problems in data warehousing. In Proceedings of the ~th Conference on Information and Knowledge Management, November 1995. Google ScholarDigital Library
Index Terms
- Beyond market baskets: generalizing association rules to correlations
Recommendations
Beyond market baskets: generalizing association rules to correlations
SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of dataOne of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing ...
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules
One of the more well-studied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B.” Motivated partly by ...
Beyond intratransaction association analysis: mining multidimensional intertransaction association rules
In this paper, we extend the scope of mining association rules from traditional single-dimensional intratransaction associations, to multidimensional intertransaction associations. Intratransaction associations are the associations among items with the ...
Comments