Efficient mining of generalized association rules with non-uniform minimum support

https://doi.org/10.1016/j.datak.2006.07.002Get rights and content

Abstract

Mining generalized association rules between items in the presence of taxonomies has been recognized as an important model in data mining. Earlier work on generalized association rules confined the minimum supports to be uniformly specified for all items or for items within the same taxonomy level. This constraint on minimum support would restrain an expert from discovering some deviations or exceptions that are more interesting but much less supported than general trends. In this paper, we extended the scope of mining generalized association rules in the presence of taxonomies to allow any form of user-specified multiple minimum supports. We discuss the problems of using classic Apriori itemset generation and presented two algorithms, MMS_Cumulate and MMS_Stratify, for discovering the generalized frequent itemsets. Empirical evaluation showed that these two algorithms are very effective and have good linear scale-up characteristics.

Introduction

Mining association rules from a large business database, such as transaction records, have been an important topic in the area of data mining. This problem is motivated by applications known as market basket analysis to find relationships between items purchased by customers, that is, what kinds of products tend to be purchased together [1]. Such information is useful in many aspects of market management, such as store layout planning, target marketing, and understanding customer behavior.

An association rule is an expression of the form A  B, where A and B are sets of items. Such a rule reveals that transactions in the database containing items in A tend to contain items in B, and the probability, measured as the fraction of transactions containing A also containing B, is called the confidence of the rule. The support of the rule is the fraction of the transactions that contain all items in both A and B.

For example, an association rule,DesktopPCInk-jetprinter(sup=30%,conf=60%),says that 30% (support) of customers purchase both Desktop PC and Ink-jet printer together, and 60% (confidence) of customers who purchase Desktop PC also purchase Ink-jet printer.

For an association rule to hold, the support and the confidence of the rule should satisfy a user-specified minimum support called minsup and minimum confidence called minconf, respectively. The problem of mining association rules is to discover all association rules that satisfy minsup and minconf. This task is usually decomposed into two steps:

  • 1.

    Frequent itemset generation: generate all itemsets that exceed the minsup.

  • 2.

    Rule construction: construct all association rules that satisfy minconf from the frequent itemsets in Step 1.

Intuitively, to discover frequent itemsets, each transaction has to be inspected to generate the supports of all combinations of items, which, however, will suffer for lots of I/O operations as well as computations. Therefore, most early work was focused on deriving efficient algorithms for finding frequent itemsets [2], [4], [9], [10]. The most well-known is Apriori [2], which relies on the observation that an itemset can be frequent if and only if all of its subsets are frequent and thus a level-wise inspection proceeding from frequent 1-itemsets to the maximal frequent itemset can avoid large numbers of I/O accesses.

Despite the great achievement in improving the efficiency of mining algorithms, the existing association rule models used in all of these studies incur some problems. First, in many applications, there are taxonomies (hierarchies), explicitly or implicitly, over the items. It may be more useful to find association at different levels of the taxonomy than only at the primitive concept level [5], [11].

Second, the frequencies of items are not uniform. Some items occur very frequently in the transactions while others rarely appear. A uniform minimum support assumption would hinder the discovery of some deviations or exceptions that are more interesting but much less supported than general trends. Furthermore, a single minimum support also ignores the fact that support requirement varies at different levels when mining association rules in the presence of taxonomy.

These observations lead us to the investigation of mining generalized association rules across different levels of taxonomy with non-uniform minimum supports. The methods we propose not only can discover associations that span different hierarchy levels but also have high potential in producing rare but informative item rules.

The remainder of this paper is organized as follows. The problem of mining generalized association rules with multiple minimum supports is formalized in Section 2. In Section 3, we explain the proposed algorithms for finding frequent itemsets. The evaluation of the proposed algorithms on IBM synthetic data and Microsoft foodmart2000 is described in Section 4. A review of related work is given in Section 5. Finally, our conclusions are stated in Section 6.

Section snippets

Problem statement

In this section, the basic concept behind generalized association rules in [11] is introduced. We then extend this model to incorporate multiple minimum supports.

Algorithm basics

Intuitively, the process of mining generalized association rules with multiple minimum supports is the same as that used in traditional association rules mining: First all frequent itemsets are discovered, and then from these itemsets rules that have large confidence are generated. Since the second phase is straightforward after all frequent itemsets have been found, we concentrate on algorithms for finding all frequent itemsets. We propose two methods, called MMS_Cumulate and MMS_Stratify,

Experiments

In this section, we evaluate the performance of algorithms, MMS_Cumulate and MMS_Stratify, using two synthetic datasets, named Synth1 and Synth2, generated by the IBM data generator [2], [11], and Microsoft foodmart2000 database (Foodmart for short), a sample supermarket data warehouse provided in MS SQL 2000. The parameter settings are shown in Table 8. The data for Foodmart is drawn from sales_fact_1997, sales_fact_1998 and sales_fact_dec_1998 in foodmart2000. The corresponding item taxonomy

Related work

The problem of mining association rules in the presence of taxonomy information was addressed first in [5], [11]. In [11], the problem is named “mining generalized association rules”, which aims to find associations between items at any level of the taxonomy under the minsup and minconf constraints. Their work, however, did not recognize the varied support requirements inherent in items at different hierarchy levels.

In [5], the problem was stated somewhat different from that in [11]. They

Conclusions

We have investigated in this paper the problem of mining generalized association rules in the presence of taxonomy and multiple minimum support specification. The classic Apriori itemset generation works in the presence of taxonomy but fails in the case of non-uniform minimum supports. We presented two algorithms, MMS_Cumulate and MMS_Stratify, for discovering these generalized frequent itemsets. Empirical evaluation showed that these two algorithms are very effective and have good linear

Ming-Cheng Tseng is a Ph.D. student in the Department of Information Engineering at the I-Shou University, Taiwan. He received the B.S. and M.S. degrees in Information Management from the I-Shou University, Taiwan, in 1999 and 2002, respectively. His research interests include data mining, data warehousing, database systems, fuzzy theory and information systems.

References (11)

  • R. Srikant et al.

    Mining generalized association rules

    Future Generation Computer Systems

    (1997)
  • R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proc. 1993...
  • R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. 20th Int. Conf. on Very Large Data...
  • S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: generalizing association rules to correlations, in: Proc....
  • S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market-basket data, in:...
There are more references available in the full text version of this article.

Cited by (36)

  • Mining frequent weighted utility itemsets in hierarchical quantitative databases

    2022, Knowledge-Based Systems
    Citation Excerpt :

    However, this approach is time-consuming, since it is also an Apriori-based algorithm, and thus the algorithm may scan the database numerous times. Tseng and Lin [30] adapted the FP-tree structure of the FP-Growth algorithm for mining FIs with multiple minimum support thresholds. This approach scans the database only twice.

  • Change detection model for sequential cause-and-effect relationships

    2018, Decision Support Systems
    Citation Excerpt :

    In the database, there are many tabs such as account, category, customer, days_check, and employee. Numerous past studies have used this dataset to verify their research models [27,33,34,40]. Finally, there are 1885 and 2879 sequences in St and St +l, in which t = 1997 and t + l = 1998, respectively (Table 1).

  • Mining of frequent patterns with multiple minimum supports

    2017, Engineering Applications of Artificial Intelligence
    Citation Excerpt :

    To confront the “rare item problem” which has been presented above, the problem of FPM involving rare items using multiple minimum support thresholds has been studied. Up to now, several algorithms such as MSApriori (Liu et al., 1999), MMS_Cumulate and MMS_Stratify (Tseng and Lin, 2007), CFP-growth (Hu and Chen, 2006), CFP-growth++ (Kiran and Reddy, 2011), REMMAR (Liu et al., 2011) and FQSP-MMS (Huang, 2013), etc. have been proposed. Characteristics of the related algorithms are shown in Table 1.

  • An efficient tree-based algorithm for mining sequential patterns with multiple minimum supports

    2013, Journal of Systems and Software
    Citation Excerpt :

    Keeping the ordering in subsequent operations and letting k-patterns be generated from (k − 1)-pattern that share the same prefix (i.e. the smallest minsup among all items in k-patterns will the same as that in the (k − 1)-pattern), the complete set of frequent patterns with MMSs can be easily discovered. The concept of MMSs has been widely applied to ARM, including generalized association rules with MMSs (Lui and Chung, 2000; Tseng and Lin, 2007), ARM with MMSs under the maximum constraint (Lee et al., 2005), fuzzy ARM with MMSs (Chen et al., 2009; Lee et al., 2004), and partial periodic pattern mining with MMSs (Chen et al., 2011). To improve the performance of ARM with MMSs, Hu and Chen (2006) extended the FP-growth algorithm to discover frequent patterns with MMSs.

View all citing articles on Scopus

Ming-Cheng Tseng is a Ph.D. student in the Department of Information Engineering at the I-Shou University, Taiwan. He received the B.S. and M.S. degrees in Information Management from the I-Shou University, Taiwan, in 1999 and 2002, respectively. His research interests include data mining, data warehousing, database systems, fuzzy theory and information systems.

Wen-Yang Lin is a Professor and the Chair of Department of Computer Science and Information Engineering in National University of Kaohsiung. He received his B.S. and M.S. both in Computer Science and Information Engineering from National Chiao-Tung University in 1988 and 1990, respectively. He then received his Ph.D. in Computer Science and Information Engineering from National Taiwan University in 1994. Dr. Lin has published more than 30 journal publications and 70 conference publications in the area of data warehousing, data mining, evolutionary computation, sparse matrix technology and large-scale supercomputing. Dr. Lin is a member of IEEE, the Taiwanese AI Association and the Institute of Information and Computing Machinery.

View full text