2.1 Basic association rule mining algorithms
Apriori, the first ARM algorithm, was proposed by Agrawal [
7], and it successfully reduced the search space size with a downward closure Apriori property that says a
\(k{\text{-itemset}}\) is frequent only if all of its subsets are frequent. The Apriori algorithm is characterized by a feature called candidate generation whereby
\(\left( {k+1} \right){\text{-candidate-itemsets}}\) are generated iteratively by combining any two frequent
\(k{\text{-itemsets}}\) which share a common prefix of length
\(\left( {k{\text{-}}1} \right)\). Further computation of the supports of each candidate itemset is then performed to determine if the candidate is frequent or not. Finally, the algorithm terminates if no more frequent itemsets can be generated.
Based on the standard Apriori algorithm, several improved variations were proposed. The performance-enhancing strategies include the hashing technique [
8], partitioning technique [
9], sampling approach [
10], dynamic counting [
11], and incremental mining [
12]. As previous studies demonstrated, these Apriori-based approaches achieved good performance when the dataset was sparse and the patterns discovered were of short length. However, such methods suffer nontrivial costs caused by generating huge numbers of candidate itemsets and extra scans over the datasets for support computation.
We then saw the emergence of new algorithms like FP (frequent pattern)-growth without candidate itemsets [
13]. First, an FP-tree, which retains information associating the itemsets, is constructed according to the frequency of 1-itemset. Next, patterns of different lengths are generated by concatenating suffix patterns, starting from frequent 1-pattern to least frequent itemsets, with frequent patterns from a conditional FP-tree, which is a subtree consisting of the set of prefix-paths in the FP-tree co-occurring with the suffix, recursively. In other words, this method only involves frequent patter
\(\left( {k+1} \right){\text{-itemset}}\) n growth instead of Apriori-like generation-and-test. In this sense, then, it applies a partitioning-based divide-and-conquer strategy, and efficiency studies demonstrated that this method has substantially reduced search time. Subsequently, multiple algorithms were proposed as extensions to the FP-growth approach, such as generating frequent itemset in a depth-first manner [
14], mining with devised hyperstructure [
15], pattern-growth mining by traversing FP-tree-like structure in both directions (top-down and bottom-up) [
16], and pattern-growth mining with tree structure in an array-based implementation form [
17,
18]. Recursive searches on the tree could result in enormous costs in certain cases; nevertheless, such methods did lay a solid foundation for the further application of tree structure in association rule mining algorithms.
Apriori and FP-growth both adopted a horizontal format for mining frequent itemsets. In contrast, Zaki proposed an Equivalence CLASS Transformation (Eclat) algorithm employing a vertical data format [
19]. Eclat also utilized Apriori’s candidate generation property of
\(\left( {k+1} \right){\text{-itemset}}\)candidates. In Eclat, however, the support computation of candidate can be done by just intersecting the sample id sets of the corresponding frequent
\(\left( {k+1} \right){\text{-itemset}}\). More simply stated, the support of any itemset can be obtained directly from the vertical sampleID without any further computation. Thus, additional scanning of the original dataset can be saved, again reducing the cost of search time.
Generally speaking, finding all frequent itemsets of a specific dataset can be regarded as a process consisting of search space traversal, itemset support computation, and search path pruning. A common strategy of traversing the search space includes breadth-first search (BFS). For example, in Apriori, the frequent
\(\left( {k+1} \right){\text{-itemset}}\) is not generated until all frequent
\(\left( {k+1} \right){\text{-itemsets}}\) have been discovered. Another common strategy is depth-first search (DFS). For example, in FP-growth, longer frequent patterns are generated recursively until no more can be done. Common strategies for support computation include counting (e.g., Apriori, FP-growth) and intersection (e.g., Eclat). In sum, the methodologies adopted by the three basic association rule mining algorithms described (Apriori [
7], FP-growth [
13] and Eclat [
19]) serve as landmarks for the development of association rules mining and constitute the basis for subsequent association algorithms.
2.2 Maximal frequent itemset mining and frequent closed itemset mining
According to the Apriori Property, it is obvious that
\(\left( {k{\text{-}}2} \right)\) subsets (except itself and
\(\varphi\)) of a certain frequent
\(k{\text{-itemset}}\) are also frequent. Such characteristic will result in massive unnecessary redundancy of frequent itemsets. To limit the redundancy, two alternative concepts were advanced, namely maximal frequent itemset mining and frequent closed itemset mining [
20].
Many algorithms were developed to mine these two categories of itemsets. For example, MaxMiner, the very first study on maximal frequent itemset mining, was proposed in 1998 [
21]. Based on Apriori, MaxMiner adopted a breadth-first search (BFS) strategy and reduced the search space by both superset and subset frequency pruning. As another efficient maximal frequent itemset mining method, MAFIA improved support counting efficiency by adopting vertical bitmaps to compress the transaction id list [
22].
For frequent closed itemset mining, numerous methods have been proposed since 1999, when A-Close, an Apriori-based frequent closed itemset mining approach, was reported [
23]. CLOSET explored frequent closed itemset mining based on FP-tree structure [
24]. Another typical frequent closed itemset mining approach is CHARM, which adopted a hybrid search strategy, known as the diffsets technique (compact form of
\(tID\) list information), and a hash-based “non-close item disposal” approach to enhance both computation and memory efficiency [
25]. In addition, AFOPT-close presented a method which can adaptively use three different structures, including array, AFOPT-tree (FP-tree like) and buckets, to represent conditional databases according to their respective densities [
26]. To integrate previous effective approaches and some newly developed techniques, CLOSET+ was proposed [
27]. After thorough performance studies on diverse datasets, CLOSET+ was considered as one of the most efficient methods at the time.
Previous studies have shown that algorithms of these two categories are usually more efficient against previous iterations. However, maximal frequent itemset has a critical defect in that the supports of its subitemsets may be different from its own. This would, in turn, result in extra scans over the dataset for support computation and its ultimate unfitness for rule extraction. Frequent closed itemset mining does not encounter such problems, essentially because all subsets of a certain frequent closed itemset must have precisely the same support as that of the frequent closed itemset. Furthermore, frequent closed itemsets can be regarded as a compressed form of the complete frequent itemsets without information loss. Based on these features and properties, we can conclude that frequent closed itemset mining is more likely to play a vital role in the development of association rules mining.
In summary, frequent closed itemsets can provide analytical power equivalent to that of complete frequent itemsets, but with much smaller size. Substantial approaches have verified the higher efficiency and better interpretability obtained by frequent closed itemset mining. However, most of the above-mentioned algorithms adopt the column-enumeration strategy. Therefore, when applying such approaches over high-dimensional datasets, the search space will tend to expand exponentially, according to the feature size, thus making computational cost prohibitive. Therefore, it is easy to see why most of the algorithms discussed thus far cannot be applied to high-dimensional datasets, again underscoring the need to develop algorithms applicable to high-dimensional datasets to keep pace with advancements in sequencing and computer technology.
2.3 Algorithms applicable to high-dimensional datasets
From previous sections, we can see that the applicability of the above-mentioned algorithms to high-dimensional datasets is limited. In this section, we will discuss approaches better able to meet this challenge.
Among them, approaches incorporating frequent closed itemset mining and row enumeration can serve as a possible solution. This idea was first explored in 2003 [
28]. Based on data in vertical format, CARPENTER constructs a row-enumeration tree and adopts a depth-first search (DFS) strategy to traverse it. Additionally, several pruning strategies are employed during the search process to cut off the branches incapable of generating frequent closed itemsets. Previous study [
28] has shown that CARPENTER gained better performance, compared to its rivals as CHARM and CLOSE+, when applied to high-dimensional microarray datasets [
27].
Adopting similar strategies, other methods were developed. For instance, RERII is like CARPENTER, but instead of searching frequent itemsets from the whole original datasets, RERII explored frequent closed itemsets in the opposite direction, starting from the nodes that represent the complete rowsets [
29]. This strategy has the potential to enhance overall performance by reducing the cost of searching short rowsets and
\(I{\text{-item}}\) rowsets.
To make CARPENTER more adaptable to more complex datasets, COBBLER integrated the strategies of both CARPENTER and CLOSE+ [
30]. Accordingly, COBBLER can dynamically switch between row-enumeration and column-enumeration to meet estimated cost conditions. Its efficiency has been verified in experiments over datasets with high dimensionality and a relatively large number of rows.
TD-CLOSE adopts a top-down row-enumeration search strategy that enables the support of a stronger pruning power against the bottom-up style adopted by CARPENTER [
31]. To guarantee closeness during the mining process, an additional closeness checking method was included in TD-CLOSE. Moreover, in 2009, an improved version of TD-CLOSE, called TTD-CLOSE, was proposed [
32]. With optimized pruning strategy and data structure, TTD-CLOSE obtained better performance than the original TD-CLOSE.
To extend the applications of frequent pattern mining, new classification methods based on ARM over high-dimensional datasets, such as FARMER and TOPKRGS, emerged [
33,
34]. With additional class information attached to the original datasets, both algorithms can extract classification rules in the form of
\(X \Rightarrow C\), where
\(C\) is a class label, and
\(X\) is a set of items. Based on previous analysis, it can be concluded that a rule extracted from frequent closed itemset, consisting of
\(k\) items in total, also implies the existence of other
\({2^k}{\text{-}}2\) rules. To reduce rule redundancy, these two algorithms only extract “interesting” rule groups instead of all rules. Specifically, FARMER adopted the concept of a rule group that consists of a unique upper bound rule and a set of lower bound rules for clustering the complete results of rules, while TOPKRGS just selects the most significant top-k covering rule groups. In addition, FARMER reinforced interestingness measures with Chi square in addition to support and confidence, while TOPKRGS adopts a prefix tree structure to speed up the frequency computation and utilizes a dynamic minimum confidence generation strategy to better-fit different datasets.
To enhance mining efficiency, HDminer employs effective search space partitioning and pruning strategies. HDminer gradually narrows down the search space by pruning off the false-valued cells based on the space partition tree instead of accumulating the true-valued cells like the FP-tree- or enumeration tree-based methods. Owing to fewer false-valued cells compared to true-valued cells, HDminer works much more efficiently than the FP-tree- or enumeration tree-based methods [
35]. HDMiner shows superiority, especially on synthetic data and dense microarray data.
To summarize, previous studies have verified the relatively high efficiency of row-enumeration algorithms for mining frequent closed itemset over high-dimensional datasets. However, with the advancement of biomedical data acquisition techniques, the volume of data has grown larger and larger, and the row size of a certain dataset may, therefore, become as large as the column size. In this case, methods such as COBBLER or TD-CLOSE, as described above, may still have trouble handling such large datasets. Consequently, instead of sequential algorithms, increased attention has focused on parallel and distributed algorithms. Actually, parallel association rule mining algorithms were proposed quite early in the 1990s [
36,
37]. However, since the effectiveness of these algorithms was challenged by complicated strategies of workload balance, fault tolerance and data distribution, as well as interconnection costs and limited computer hardware capacity at that time, extensive application of such algorithms was suppressed. On the other hand, based on the exuberance over cloud computing and distributed computing techniques, parallelized association rule mining algorithms were revived with the opportunity to show their power. Specifically, as the most recognized large-scale data analysis technique, Hadoop has been broadly utilized in modern biomedical studies [
38,
39]. Characterized by mapper and reducer functions [
40], the Hadoop MapReduce framework is especially good at processing gigabytes, or even terabytes, of data. Moreover, by hiding details of underlying controls, Hadoop can enable users to just concentrate on algorithm design. All of these features make Hadoop a novel promising candidate to propel the development of ARM in the “big data” era.
Typical examples of adapting Apriori on Hadoop MapReduce include SPC (Single Pass Counting), FPC (Fixed Pass Combined-Counting) and DPC (Dynamic Pass Counting) [
41]. These algorithms share common procedures of distributing data to different mappers and parallel counting supports, but differ in candidate generation. Typically, SPC generates frequent itemsets of only a single length after one phase of MapReduce, but FPC and DPC generate frequent itemsets with different lengths after phases. In addition, as the names suggest, and are fixed parameters in FPC, while in DPC, they are dynamically determined by the number of generated candidate itemsets at each MapReduce phase. Other Hadoop-based Apriori algorithms, which work in a similar manner, but different forms of implementation, were also proposed [
42,
43].
To solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm of the parallelization was realized in based on Hadoop framework and Map Reduce model. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset [
44]. MRFP-Growth (MapReduce Frequent Pattern Growth) is also implementing to solve the problem of discovering frequent patterns with massive datasets. The efficiency and performance of this method have been increased compared with other mining algorithms [
45]. Also, implementations of Eclat on MapReduce were proposed, such as Dist-Eclat, focusing on speed acceleration, and its optimized version BigFIM which adopts hybrid approaches incorporating both Apriori and Eclat, thus making BigFIM better suited to very large datasets [
46]. Experiments on real datasets have proven their scalability. An MapReduce algorithm for mining closed frequent itemsets was implemented, as well [
47].
Another noteworthy approach is PARMA [
48]. It applies parallel mining algorithms to randomly selected subsets of the original large dataset. Owing to its random sample property, its mined results can be considered as the approximation of the exact results according to the whole dataset. The approximation quality was verified by both both statistical analysis and real-time application.
Based on CARPENTER, a new algorithm called PaMPa-HD was developed. This algorithm adopts the depth-first search process, as well, but the process is broken up into independent subprocesses to which a centralized version of CARPENTER is applied so that it can autonomously evaluate subtrees of the search space. Then the final closed itemsets of each subprocess can be extracted in order to compute the whole closed itemset result [
49]. Since the subprocesses are independent, they can be executed in parallel by means of a distributed computing platform such as Hadoop.
To achieve compressed storage and avoid building conditional pattern bases, FiDoop was brought forward. FiDoop utilizes a frequent itemset ultrametric tree. In FiDoop, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees is performed separately, which can speed up the mining performance for high-dimensional datasets analysis [
50]. Extensive experiments using real-world celestial spectral data indicate that FiDoop is scalable and efficient.
In addition to the huge computation cost, it is typical for the size of derived patterns from high-dimensional datasets to be enormous. Such growth of derived patterns makes their effective use difficult. Therefore, a new methodology aimed at mining approximate or representative patterns, instead of full-scale patterns, appeared as a solution. The most typical approach, known as Pattern-Fusion, is based on a novel concept called core pattern. Pattern-Fusion is able to discover approximate colossal patterns, i.e., the colossal pattern mining algorithm based on pattern fusion improve seed pattern selection method, which is select pattern that the distant big, rather than random seed pattern [
51].
Recently, the Graphics Processor Units (GPU) has emerged as one of the most used parallel hardware to solve large scientific complex problems. An approach benefits from the massively parallel power of GPU by using a large number of threads to evaluate association rule mining was proposed [
52]. Then a new algorithm called MWBSO-MEGPU was proposed. This method combine both GPU and cluster computing to improve a Bees Swarm Optimization (BSO) metaheuristic. Several tests have been carried out to evaluate this approach. The results reveal that MWBSO-MEGPU outperforms the HPC-based ARM approaches in terms of speed up when exploring Webdocs instance [
53].