Simultaneous mining of frequent closed itemsets and their generators: Foundation and algorithm
Introduction
Association rule mining (Agrawal et al., 1993) from transaction databases is a fundamental technique in data mining. The task is to determine the association rules that satisfy the pre-defined minimum support and confidence from a given database. It was originally designed for market basket applications (Agrawal et al., 1993), but has been extended to various domains, such as risk management, telecommunication networks, and bio-sequences. Association rule mining has two phases (Agrawal and Srikant, 1994): (a) extraction of all frequent itemsets whose occurrences exceed the minimum support, and (b) generation of association rules that satisfy the given minimum confidence from the itemsets. If all frequent itemsets and their supports are known, association rule generation is straightforward. Hence, most researchers have concentrated on finding efficient methods for mining frequent itemsets.
The basic algorithms for mining frequent itemsets are Apriori, FP-growth, and Eclat. Apriori and its variations (Agrawal et al., 1993, Agrawal and Srikant, 1994) are based on the Apriori property, which states that every subset of a frequent itemset is also frequent, i.e., the support of an itemset never exceeds the supports of its subsets. Although this anti-monotone property helps significantly reduce the search space, Apriori-based algorithms are not efficient as they generate many redundant candidates, which increase the CPU and memory burden. Further, they have to scan the database multiple times. To overcome these issues, frequent pattern tree-based algorithms were proposed by Han and Pei (2000) and Han et al. (2004). The original database is compressed into a FP-tree or a similar tree structure. Using divide-and-conquer and depth-first search approaches, all large itemsets are mined from frequent 1-itemsets1 without having to rescan the database. However, in interactive or incremental mining systems, where the users often change the minimum support and insert new transactions into the original database, FP-tree-inspired structures are unsuitable because the trees need to be rebuilt. Both the Apriori and FP-tree based methods work with a horizontal data format. Zaki proposed Eclat (Zaki, 2000) and DEclat (Zaki and Gouda, 2003) for mining with a vertical data format. These algorithms all show good performance for sparse databases with short itemsets, such as market databases. For dense databases, which produce long frequent itemsets, such as bio-sequences and telecommunication networks, the frequent itemset class can grow to be unwieldy even if the minimum support is large (Bayardo, 1998). A frequent itemset of length n produces 2n−1 frequent non-empty, proper sub-itemsets. Hence, the generation of frequent itemsets not only has the large time complexity O(2N) (where N is the number of items) but also produces many duplicates in the huge search space. Mining only maximal frequent itemsets is one of the solutions for overcoming the drawbacks mentioned above. Many algorithms have been proposed for mining such itemsets (Bayardo, 1998, Burdick et al., 2001). An itemset is maximal frequent if none of its proper supersets are frequent. The number of maximal itemsets is much smaller than that of all frequent itemsets (Zaki and Hsiao, 2005). Although all sub-itemsets of a maximal itemset are frequent, their actual supports are unknown. Further, since frequent itemsets can come from different maximal ones, it takes a lot of time to mine and delete the duplicates. Therefore, maximal frequent itemsets are unsuitable for frequent itemset and association rule generation.
A more suitable approach to overcome this difficulty is using the closures of itemsets, i.e., closed itemsets. The maximal itemset class is contained in the closed itemset class, which, in turn, is a subset of the itemset class. An extensive experimental evaluation conducted by Zaki and Hsiao (2005) showed that, for real databases, the number of frequent closed itemsets is about 10 times greater than that of maximal frequent itemsets, but about 100 times smaller than the cardinality of frequent itemsets. Hence, mining frequent closed itemsets has received the attention of many researchers (Pasquier et al., 1999, Pei et al., 2000, Wang et al., 2003, Singh et al., 2005). An itemset is closed iff2 it is identical to its closure. This concept is similar to the concept lattice (Birkhoff, 1967, Wille, 1982, Wille, 1992, Davey and Priestley, 1994, Ganter and Wille, 1999) and has been recently applied (Boulicaut et al., 2003, Zaki, 2004, Pasquier et al., 2005). A generator (Pasquier et al., 1999, Szathmary et al., 2009) of an itemset is its minimal subset that has the same closure as its own. Generators are also called minimal generators (Zaki, 2004, Dong et al., 2005), key patterns (Bastide et al., 2000), and free-sets (Boulicaut et al., 2003). Although there are many definitions of closed itemsets and generators, they are equivalent (see Section 2.1).
Closed itemsets together with their lattice structure and generators, called , play an important role in both itemset and association rule mining. First, their cardinality is typically orders of magnitude much lower than that of all itemsets. Whenever the user changes the minimum support, all frequently closed itemsets and their generators can be quickly derived from . Second, two itemsets are equivalent if they have the same closure. In the study by Anh et al. (2012b), based on this equivalence relation, all itemsets were partitioned into disjoint equivalence classes. This decreases most duplication in the generation of all itemsets. Further, each class can be explored independently. In each class, a closed itemset is a maximum set, its generators are minimal subsets, and each remaining itemset has a unique representation through its closure and generators. Thus, the duplication in the generation of all frequent itemsets is completely removed. Hence, frequent closed itemsets together with their generators produce a lossless representation of all frequent itemsets. Third, many studies have used the generators of closed itemset for mining association rules (Balcazar, 2010, Bay et al., 2012, Anh et al., 2012a, Pasquier et al., 1999, Pasquier et al., 2005, Zaki, 2004, Tin and Anh, 2010a, Tin et al., 2010b). All rules can be obtained based on frequent closed itemsets and their generators. The lattice of frequent closed itemsets and generators with constraints is essential for the discovery of frequent itemsets and association rules with item constraints, especially when the minimum support and confidence thresholds and item constraints often change (Anh et al., 2011, Anh et al., 2012b, Anh et al., 2014, Hai et al., 2013, Hai et al., 2014).
The problem of mining frequent closed itemsets and generators is stated as follows: given a transaction database and a minimum support threshold, the task is to find all frequent closed itemsets together with their generators. The algorithms for mining closed itemsets can be divided into three approaches, namely generate-and-test, divide-and-conquer and hybrid. Many algorithms have been proposed for mining closed itemsets, including Close (Pasquier et al., 1999) (generate-and-test), Closet (Pei et al., 2000) and Closet+ (Wang et al., 2003) (divide-and-conquer), and Charm (Zaki and Hsiao, 2005) and CloseMiner (Singh et al., 2005) (hybrid). Algorithms for mining generators include Talky-G (Szathmary et al., 2009) and MinimalGenerator (Zaki, 2004). However, most of these algorithms discover frequent closed itemsets and generators separately. The exception is Close, which mines them simultaneously. However, its execution is computationally expensive. The present study proposes GENCLOSE, which has the following key features:
- 1)
Generators and frequent closed itemsets are simultaneously found using breadth-first search over an IOG-tree (Itemset-Object-set3-Generator tree).
- 2)
At each level, the generators are first mined using a necessary and sufficient condition to determine the class of (i+1)-generators from the class of i-generators (i≥1) based on the object-sets (or diffsets in practice).
- 3)
Three extension operators are proposed to extend itemsets toward their closures when mining generators.
The rest of this paper is organized as follows. Section 2 gives the background of closed sets, generators, and their definitions. Related work is also discussed in this section. Section 3 proposes some necessary and sufficient conditions for producing generators and three operators for extending generators to their closures. Based on these results, the GENCLOSE algorithm is constructed. In Section 4, GENCLOSE is compared to CharmLMG and DTouch, two well-known algorithms for finding closed itemsets and generators. The conclusion is given in Section 5. For better readability, some proofs and implemented techniques are given in the appendices.
Section snippets
Foundation of mining closed itemsets and their generators
Consider two non-empty sets: containing objects (or transactions) and containing all items (attributes) related to transactions o∈. Let be a binary relation in . A triple is called a transaction database or a binary database, or briefly a database. A set of items A⊆ and a set O⊆ are called an itemset and an object-set, respectively. Let and be the power sets of and . Two set functions of λ: and ρ: are determined as follows: ∀A⊆ , O⊆, A≠∅, O≠∅, λ(O)={a
GENCLOSE: background and algorithm
Based on the foundation of mining closed itemsets and their generators given in Section 2, we propose GENCLOSE, which executes a breadth-first search over an IOG-tree to find generators. Simultaneously, the closed itemsets (their closures) are gradually explored. In each step, the algorithm tests an efficient necessary and sufficient condition in order to produce new generators from the generators mined in the previous step. Unlike Close, which needs to scan the database to compute the closed
Experimental study
Experiments were carried out on a personal computer with an Intel i5-2400 3-GHz CPU and 3.16 GB of RAM running under Linux and Cygwin. To test the correctness and performance of GENCLOSE, we compared it to CharmLMG and DTouch, two well-known algorithms for finding closed itemsets and generators. The source code of CharmLMG (includes Charm-L and MinimalGenerator) in C++ is publicly available from the website of the author himself (http://www.cs.rpi.edu/~zaki). DTouch is a fast implementation of
Conclusions and future works
This study described the properties of closed sets and generators as well as their relations. The necessary and sufficient conditions to produce generators and operators to determine closed itemsets were derived. The GENCLOSE algorithm was proposed for simultaneously mining frequent closed itemsets and their generators. Using the IOG-tree framework, it uses the itemset object-set and generator space, which allow not only quick identification of new generators without accessing the database or
Acknowledgment
This work was funded by the National Foundation for Science and Technology Development (NAFOSTED) under Grant no. 102.05-2013.20.
References (37)
- et al.
An efficient approach for mining cross-level closed itemsets and minimal association rules using closed itemset lattices
Expert Syst. Appl.
(2014) - et al.
Efficient mining of association rules using closed item set lattices
Inf. Syst.
(1999) Concept lattices and conceptual knowledge systems
Comput. Math. Appl.
(1992)- Agrawal, R., Imielinski, T., Swami, N., 1993. Mining association rules between sets of items in large databases. In:...
- Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of the 20th International...
- Anh, T., Hai, D., Tin, T., Bac, L., 2011. Efficient algorithms for mining frequent itemsets with constraint. In:...
- Anh, T., Tin, T., Bac, L., 2012a. Structures of association rule set. In: Proceedings of ACIIDS 2012, LNAI 7197, Part...
- Anh, T., Hai, D., Tin, T., Bac, L., 2012b. Mining frequent itemsets with dualistic constraints. In: Proceedings of...
- et al.
An approach for mining concurrently closed itemsets and generators
Adv. Comput. Methods Knowl. Eng.
(2013) - et al.
An approach for mining association rules intersected with constraint itemsets
Adv. Intell. Syst. Comput.
(2014)
Redundancy, deduction schemes, and minimum-size base for association rules
Log. Methods Comput. Sci.
Mining frequent patterns with counting inference
SIGKDD Explor.
Mining most generalization association rules based on frequent closed itemset
Int. J. Innov. Comput., Inform. Control
Lattice Theory
Free-Sets: a condensed representation of boolean data for the approximation of frequency queries
Data Min. Knowl. Discov.
Introduction to Lattices and Order
Cited by (12)
FMaxCloHUSM: An efficient algorithm for mining frequent closed and maximal high utility sequences
2019, Engineering Applications of Artificial IntelligenceAn efficient dynamic superset bit-vector approach for mining frequent closed itemsets and their lattice structure
2017, Expert Systems with ApplicationsCitation Excerpt :A variety of patterns have been discovered and explored in the pattern mining field. Frequent patterns (Ahmed, Tanbeer, Jeong, Lee, & Choi, 2012; Chen, Shu, Xia, & Deng, 2012; Lee & Wang, 2007; Liu et al., 2009; Tanbeer, Ahmed, Jeong, & Lee, 2009), maximal frequent patterns (Yun, Lee, & Ryu, 2014), weighted maximal frequent patterns (Lee, Yun, & Ryu, 2014; Yun & Lee, 2016), approximate maximal frequent patterns (Lee, Yun, Ryang, & Kim, 2016), closed frequent patterns (Le & Vo, 2015; Nori, Deypir, & Sadreddini, 2013; Tran, Truong, & Le, 2014; Yen, Lee, & Wang, 2014; Yun & Yoon, 2014), weighted sequential patterns (Yun, Pyun, & Yoon, 2015), correlated weight frequent patterns (Yun & Ryu, 2013) are some of the recently studied patterns. All of these patterns can be used to derive interesting association rules.
A comprehensive review on updating concept lattices and its application in updating association rules
2021, Wiley Interdisciplinary Reviews: Data Mining and Knowledge DiscoveryA Systematic Survey on High Utility Itemset Mining
2019, International Journal of Information Technology and Decision MakingA gossip based information fusion protocol for distributed frequent itemset mining
2018, Enterprise Information SystemsICFDminer: An incremental algorithm of mining constant CFDs from dynamic databases
2018, ACM International Conference Proceeding Series