Simultaneous mining of frequent closed itemsets and their generators: Foundation and algorithm

doi:10.1016/j.engappai.2014.07.004

Engineering Applications of Artificial Intelligence

Volume 36, November 2014, Pages 64-80

https://doi.org/10.1016/j.engappai.2014.07.004 Get rights and content

Highlights

•
We proposed GENCLOSE to simultaneously mine generators and frequent closed itemsets.
•
A necessary and sufficient condition is applied to find generators.
•
Three extension operators are used to determine frequent closed itemsets.
•
The correctness of GENCLOSE is proven reliably.
•
Extensive experiments show that GENCLOSE outperforms existing well-known algorithms.

Abstract

Closed itemsets and their generators play an important role in frequent itemset and association rule mining. They allow a lossless representation of all frequent itemsets and association rules and facilitate mining. Some recent approaches discover frequent closed itemsets and generators separately. The Close algorithm mines them simultaneously but it needs to scan the database many times. Based on the properties and relationships of closed itemsets and generators, this study proposes GENCLOSE, an efficient algorithm for mining frequent closed itemsets and generators simultaneously. The level-wise search over an ItemsetObject–setGenerator–Tree enumerates the generators by using a necessary and sufficient condition to produce (i+1)-item generators from i-item generators. This condition, based on transaction (object) sets that can be efficiently implemented using diffsets, is very convenient and reliably proved. In the search, pre-closed itemsets are gradually extended using three proposed extension operators. It is shown that these itemsets produce the expected closed itemsets. Extensive experiments on many benchmark databases confirm the efficiency of the proposed approach.

Introduction

Association rule mining (Agrawal et al., 1993) from transaction databases is a fundamental technique in data mining. The task is to determine the association rules that satisfy the pre-defined minimum support and confidence from a given database. It was originally designed for market basket applications (Agrawal et al., 1993), but has been extended to various domains, such as risk management, telecommunication networks, and bio-sequences. Association rule mining has two phases (Agrawal and Srikant, 1994): (a) extraction of all frequent itemsets whose occurrences exceed the minimum support, and (b) generation of association rules that satisfy the given minimum confidence from the itemsets. If all frequent itemsets and their supports are known, association rule generation is straightforward. Hence, most researchers have concentrated on finding efficient methods for mining frequent itemsets.

The basic algorithms for mining frequent itemsets are Apriori, FP-growth, and Eclat. Apriori and its variations (Agrawal et al., 1993, Agrawal and Srikant, 1994) are based on the Apriori property, which states that every subset of a frequent itemset is also frequent, i.e., the support of an itemset never exceeds the supports of its subsets. Although this anti-monotone property helps significantly reduce the search space, Apriori-based algorithms are not efficient as they generate many redundant candidates, which increase the CPU and memory burden. Further, they have to scan the database multiple times. To overcome these issues, frequent pattern tree-based algorithms were proposed by Han and Pei (2000) and Han et al. (2004). The original database is compressed into a FP-tree or a similar tree structure. Using divide-and-conquer and depth-first search approaches, all large itemsets are mined from frequent 1-itemsets¹ without having to rescan the database. However, in interactive or incremental mining systems, where the users often change the minimum support and insert new transactions into the original database, FP-tree-inspired structures are unsuitable because the trees need to be rebuilt. Both the Apriori and FP-tree based methods work with a horizontal data format. Zaki proposed Eclat (Zaki, 2000) and DEclat (Zaki and Gouda, 2003) for mining with a vertical data format. These algorithms all show good performance for sparse databases with short itemsets, such as market databases. For dense databases, which produce long frequent itemsets, such as bio-sequences and telecommunication networks, the frequent itemset class can grow to be unwieldy even if the minimum support is large (Bayardo, 1998). A frequent itemset of length n produces 2ⁿ⁻¹ frequent non-empty, proper sub-itemsets. Hence, the generation of frequent itemsets not only has the large time complexity O(2^N) (where N is the number of items) but also produces many duplicates in the huge search space. Mining only maximal frequent itemsets is one of the solutions for overcoming the drawbacks mentioned above. Many algorithms have been proposed for mining such itemsets (Bayardo, 1998, Burdick et al., 2001). An itemset is maximal frequent if none of its proper supersets are frequent. The number of maximal itemsets is much smaller than that of all frequent itemsets (Zaki and Hsiao, 2005). Although all sub-itemsets of a maximal itemset are frequent, their actual supports are unknown. Further, since frequent itemsets can come from different maximal ones, it takes a lot of time to mine and delete the duplicates. Therefore, maximal frequent itemsets are unsuitable for frequent itemset and association rule generation.

A more suitable approach to overcome this difficulty is using the closures of itemsets, i.e., closed itemsets. The maximal itemset class is contained in the closed itemset class, which, in turn, is a subset of the itemset class. An extensive experimental evaluation conducted by Zaki and Hsiao (2005) showed that, for real databases, the number of frequent closed itemsets is about 10 times greater than that of maximal frequent itemsets, but about 100 times smaller than the cardinality of frequent itemsets. Hence, mining frequent closed itemsets has received the attention of many researchers (Pasquier et al., 1999, Pei et al., 2000, Wang et al., 2003, Singh et al., 2005). An itemset is closed iff² it is identical to its closure. This concept is similar to the concept lattice (Birkhoff, 1967, Wille, 1982, Wille, 1992, Davey and Priestley, 1994, Ganter and Wille, 1999) and has been recently applied (Boulicaut et al., 2003, Zaki, 2004, Pasquier et al., 2005). A generator (Pasquier et al., 1999, Szathmary et al., 2009) of an itemset is its minimal subset that has the same closure as its own. Generators are also called minimal generators (Zaki, 2004, Dong et al., 2005), key patterns (Bastide et al., 2000), and free-sets (Boulicaut et al., 2003). Although there are many definitions of closed itemsets and generators, they are equivalent (see Section 2.1).

Closed itemsets together with their lattice structure and generators, called $ℒ G_{A}$ , play an important role in both itemset and association rule mining. First, their cardinality is typically orders of magnitude much lower than that of all itemsets. Whenever the user changes the minimum support, all frequently closed itemsets and their generators can be quickly derived from $ℒ G_{A}$ . Second, two itemsets are equivalent if they have the same closure. In the study by Anh et al. (2012b), based on this equivalence relation, all itemsets were partitioned into disjoint equivalence classes. This decreases most duplication in the generation of all itemsets. Further, each class can be explored independently. In each class, a closed itemset is a maximum set, its generators are minimal subsets, and each remaining itemset has a unique representation through its closure and generators. Thus, the duplication in the generation of all frequent itemsets is completely removed. Hence, frequent closed itemsets together with their generators produce a lossless representation of all frequent itemsets. Third, many studies have used the generators of closed itemset for mining association rules (Balcazar, 2010, Bay et al., 2012, Anh et al., 2012a, Pasquier et al., 1999, Pasquier et al., 2005, Zaki, 2004, Tin and Anh, 2010a, Tin et al., 2010b). All rules can be obtained based on frequent closed itemsets and their generators. The lattice of frequent closed itemsets and generators with constraints is essential for the discovery of frequent itemsets and association rules with item constraints, especially when the minimum support and confidence thresholds and item constraints often change (Anh et al., 2011, Anh et al., 2012b, Anh et al., 2014, Hai et al., 2013, Hai et al., 2014).

The problem of mining frequent closed itemsets and generators is stated as follows: given a transaction database and a minimum support threshold, the task is to find all frequent closed itemsets together with their generators. The algorithms for mining closed itemsets can be divided into three approaches, namely generate-and-test, divide-and-conquer and hybrid. Many algorithms have been proposed for mining closed itemsets, including Close (Pasquier et al., 1999) (generate-and-test), Closet (Pei et al., 2000) and Closet+ (Wang et al., 2003) (divide-and-conquer), and Charm (Zaki and Hsiao, 2005) and CloseMiner (Singh et al., 2005) (hybrid). Algorithms for mining generators include Talky-G (Szathmary et al., 2009) and MinimalGenerator (Zaki, 2004). However, most of these algorithms discover frequent closed itemsets and generators separately. The exception is Close, which mines them simultaneously. However, its execution is computationally expensive. The present study proposes GENCLOSE, which has the following key features:

1)
Generators and frequent closed itemsets are simultaneously found using breadth-first search over an IOG-tree (Itemset-Object-set³-Generator tree).
2)
At each level, the generators are first mined using a necessary and sufficient condition to determine the class of (i+1)-generators from the class of i-generators (i≥1) based on the object-sets (or diffsets in practice).
3)
Three extension operators are proposed to extend itemsets toward their closures when mining generators.

The rest of this paper is organized as follows. Section 2 gives the background of closed sets, generators, and their definitions. Related work is also discussed in this section. Section 3 proposes some necessary and sufficient conditions for producing generators and three operators for extending generators to their closures. Based on these results, the GENCLOSE algorithm is constructed. In Section 4, GENCLOSE is compared to CharmLMG and DTouch, two well-known algorithms for finding closed itemsets and generators. The conclusion is given in Section 5. For better readability, some proofs and implemented techniques are given in the appendices.

Section snippets

Foundation of mining closed itemsets and their generators

Consider two non-empty sets: $O$ containing objects (or transactions) and $A$ containing all items (attributes) related to transactions o∈ $O$ . Let $ℛ$ be a binary relation in $O \times A$ . A triple $D = (O, A, ℛ)$ is called a transaction database or a binary database, or briefly a database. A set of items A⊆ $A$ and a set O⊆ $O$ are called an itemset and an object-set, respectively. Let $2^{O}$ and $2^{A}$ be the power sets of $O$ and $A$ . Two set functions of λ: $2^{O} \to 2^{A}$ and ρ: $2^{A} \to 2^{O}$ are determined as follows: ∀A⊆ $A$ , O⊆ $O$ , A≠∅, O≠∅, λ(O)={a

GENCLOSE: background and algorithm

Based on the foundation of mining closed itemsets and their generators given in Section 2, we propose GENCLOSE, which executes a breadth-first search over an IOG-tree to find generators. Simultaneously, the closed itemsets (their closures) are gradually explored. In each step, the algorithm tests an efficient necessary and sufficient condition in order to produce new generators from the generators mined in the previous step. Unlike Close, which needs to scan the database to compute the closed

Experimental study

Experiments were carried out on a personal computer with an Intel i5-2400 3-GHz CPU and 3.16 GB of RAM running under Linux and Cygwin. To test the correctness and performance of GENCLOSE, we compared it to CharmLMG and DTouch, two well-known algorithms for finding closed itemsets and generators. The source code of CharmLMG (includes Charm-L and MinimalGenerator) in C⁺⁺ is publicly available from the website of the author himself (http://www.cs.rpi.edu/~zaki). DTouch is a fast implementation of

Conclusions and future works

This study described the properties of closed sets and generators as well as their relations. The necessary and sufficient conditions to produce generators and operators to determine closed itemsets were derived. The GENCLOSE algorithm was proposed for simultaneously mining frequent closed itemsets and their generators. Using the IOG-tree framework, it uses the itemset object-set and generator space, which allow not only quick identification of new generators without accessing the database or

Acknowledgment

This work was funded by the National Foundation for Science and Technology Development (NAFOSTED) under Grant no. 102.05-2013.20.

References (37)

T. Hashem et al.
An efficient approach for mining cross-level closed itemsets and minimal association rules using closed itemset lattices
Expert Syst. Appl.
(2014)
N. Pasquier et al.
Efficient mining of association rules using closed item set lattices
Inf. Syst.
(1999)
R. Wille
Concept lattices and conceptual knowledge systems
Comput. Math. Appl.
(1992)
Agrawal, R., Imielinski, T., Swami, N., 1993. Mining association rules between sets of items in large databases. In:...
Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of the 20th International...
Anh, T., Hai, D., Tin, T., Bac, L., 2011. Efficient algorithms for mining frequent itemsets with constraint. In:...
Anh, T., Tin, T., Bac, L., 2012a. Structures of association rule set. In: Proceedings of ACIIDS 2012, LNAI 7197, Part...
Anh, T., Hai, D., Tin, T., Bac, L., 2012b. Mining frequent itemsets with dualistic constraints. In: Proceedings of...
T. Anh et al.
An approach for mining concurrently closed itemsets and generators
Adv. Comput. Methods Knowl. Eng.
(2013)
T. Anh et al.
An approach for mining association rules intersected with constraint itemsets
Adv. Intell. Syst. Comput.
(2014)

J.L. Balcazar

Redundancy, deduction schemes, and minimum-size base for association rules

Log. Methods Comput. Sci.

(2010)

Y. Bastide et al.

Mining frequent patterns with counting inference

SIGKDD Explor.

(2000)

V. Bay et al.

Mining most generalization association rules based on frequent closed itemset

Int. J. Innov. Comput., Inform. Control

(2012)

Bayardo, R.J., 1998. Efficiently mining long patterns from databases. In: Proceedings of the SIGMOD Conference 1998,...

G. Birkhoff

Lattice Theory

(1967)

J. Boulicaut et al.

Free-Sets: a condensed representation of boolean data for the approximation of frequency queries

Data Min. Knowl. Discov.

(2003)

Burdick, D., Calimlim, M., Gehrke, J., 2001. MAFIA: a maximal frequent itemset algorithm for transactional databases....

B.A. Davey et al.

Introduction to Lattices and Order

(1994)

Cited by (12)

FMaxCloHUSM: An efficient algorithm for mining frequent closed and maximal high utility sequences
2019, Engineering Applications of Artificial Intelligence
Mining all frequent high utility sequences ( $FHUS$ ) in quantitative sequential databases (QSDBs) is a generalization of the problem of mining all frequent sequences in non-quantitative sequence databases. In the last decade, the former problem has attracted the attention of many researchers because utility-based sequences are more informative and actionable for decision-making than frequent sequences. Although utility-based sequences have many real-life applications, their number is often very large, especially for low minimum utility thresholds and long sequences. It can thus be difficult for users to analyze them and mining utility-based sequences often requires much time and memory. To solve this problem, this paper proposes two concise representations of $FHUS$ , having a small cardinality that provides a concise summary of all $FHUS$ . Those representations are defined as two sets, $FCHUS$ and $FM a x HUS$ , of all frequent maximal and closed high utility sequences. To efficiently mine these concise representations, two width and depth pruning strategies are proposed for eliminating low utility sequences early and a novel local pruning strategy is proposed named $LPCHUS$ using a new extended measure on projected databases for eliminating non-closed and non-maximal high utility sequences early as well as their extensions. Based on these strategies and a novel data structure named SIDUL in vertical format, an algorithm named FMaxCloHUSM is designed for efficiently mining the sets of $FCHUS$ and $FM a x HUS$ , separately or simultaneously. To our best knowledge, this is the first algorithm for discovering these two concise representations. An experimental study conducted using both real-life and synthetic QSDBs shows that the proposed algorithm is efficient in terms of time and memory consumption, and that the developed strategies greatly reduce the search space.
An efficient dynamic superset bit-vector approach for mining frequent closed itemsets and their lattice structure
2017, Expert Systems with Applications
Citation Excerpt :
A variety of patterns have been discovered and explored in the pattern mining field. Frequent patterns (Ahmed, Tanbeer, Jeong, Lee, & Choi, 2012; Chen, Shu, Xia, & Deng, 2012; Lee & Wang, 2007; Liu et al., 2009; Tanbeer, Ahmed, Jeong, & Lee, 2009), maximal frequent patterns (Yun, Lee, & Ryu, 2014), weighted maximal frequent patterns (Lee, Yun, & Ryu, 2014; Yun & Lee, 2016), approximate maximal frequent patterns (Lee, Yun, Ryang, & Kim, 2016), closed frequent patterns (Le & Vo, 2015; Nori, Deypir, & Sadreddini, 2013; Tran, Truong, & Le, 2014; Yen, Lee, & Wang, 2014; Yun & Yoon, 2014), weighted sequential patterns (Yun, Pyun, & Yoon, 2015), correlated weight frequent patterns (Yun & Ryu, 2013) are some of the recently studied patterns. All of these patterns can be used to derive interesting association rules.
Fast discovery of association rules from millions of transactions in a variety of large databases has now become a major challenge in data mining domain. Frequent itemsets and frequent closed itemsets are key sources of mining association rules. Association rules can be mined efficiently from lattice of itemsets. Non-redundancy, accuracy, time efficiency and memory usage are the factors those need to be considered while developing algorithms in order to extract meaningful association rules. In this paper, we propose an efficient bit-vector approach that exploits dual properties, tidset and superset information of an itemset in order to mine frequent closed itemsets with their lattice structure. We introduce a new memory efficient data structure called dynamic superset bit-vector to establish the relationship among frequent closed itemsets in a lattice. The novelty of our approach is that it effectively uses dual data structures called a dynamic bit-vector and a dynamic superset bit-vector jointly in order to reduce the search space and eliminate the generators of non-closed itemsets. Our proposed approach efficiently builds up the subset-superset relationship among the closed itemsets of the lattice structure in a bottom-up manner. Extensive experiments using real-life datasets and pervasive performance comparison with the existing works prove the efficiency and scalability of our proposed approach.
A comprehensive review on updating concept lattices and its application in updating association rules
2021, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A Systematic Survey on High Utility Itemset Mining
2019, International Journal of Information Technology and Decision Making
A gossip based information fusion protocol for distributed frequent itemset mining
2018, Enterprise Information Systems
ICFDminer: An incremental algorithm of mining constant CFDs from dynamic databases
2018, ACM International Conference Proceeding Series

View all citing articles on Scopus

View full text