Selecting the best measures to discover quantitative association rules
Introduction
Hybrid artificial intelligent systems are rapidly gaining relevance in the scientific community due to the ability shown to deal with real-life problems [1], [13], [14]. These systems combine the use of both extracted knowledge and raw data to solve problems.
High volume of data can be stored nowadays; therefore, the use of efficient computational techniques has become a task of the utmost importance. In this context, the discovery of association rules (AR) – and particularly of quantitative association rules (QAR) in this work – is a popular methodology that allows the discovery of significant and apparently hidden relations among variables that form databases [3], [4], [27], [28].
The AR extraction process consists in using a non-supervised strategy to explore data properties. The main goal pursuit is, then, to find groups of attributes appearing frequently together in a dataset, so to provide comprehensive rules able to explain the existing relations among them.
A review of the literature reveals that there exist many algorithms to find AR. Most of them are based on the methods proposed by Agrawal et al. such as AIS [2], Apriori [3] or SETM [26].
Nonetheless, there is another big group of techniques to extract AR that are based on evolutionary algorithms (EA). EA are search algorithms that generate solutions for optimization problems using techniques inspired in natural evolution [18], [23], in which a population of abstract representations (chromosomes) of candidate solutions (individuals) evolves toward better solutions. EA can be used to discover AR, since they offer several advantages for knowledge extraction and for rule induction processes [7].
The algorithms that discover AR are normally assessed by means of certain interestingness measures that are able to evaluate the quality of a rule. From all of them, support and confidence highlight although lift, gain, certainty factor or leverage are also indicators that provide useful information about the extracted rules.
A review on AR learning based on the use of EA applied to boolean, categorical, quantitative and fuzzy variables has been described in [16]. However, as this work is focused on quantitative variables only the works using this kind of data are reviewed in this section. Table 1 summarizes the measures used for both evaluation and optimization in several works recently published. From the observation of this table, one conclusion can be easily drawn: There is no uniformity on the selection of measures to assess the algorithms' performance.
For instance, an EA called EARMGA was used in [45] to obtain QAR. The confidence was the only objective to be optimized in the fitness function. To achieve this goal, the authors avoided the specification of the actual minimum support, which can be considered the main contribution of the work.
The combination of confidence and support as only quality measures can be found in several works. Hence, the work introduced in [43] proposes an approach to discover QAR by clustering items of a dataset and projecting the clusters into the domains of the quantitative attributes to form meaningful intervals. Also, the algorithm called QuantMiner [40] proposed a genetic algorithm to mine QAR and optimize support and confidence, by using a fitness function based on the gain measure proposed in [19].
The extraction of QAR has also been applied to the data streams field. A classifier, whose main novelty lied on its adaptability to on-line gathered data was presented in [36]. By contrast, a multi-objective approach was proposed in [39]. The algorithm did not consider the minimum support and confidence and applied the FP-tree algorithm [25]. The fitness function maximized both support and confidence of the rule. Finally, some works such as [9] have proposed the use of an extended set of operators to mine general association rules and have evaluated the proposal in terms of confidence and support.
Additionally to support and confidence, the authors of the work introduced in [33] used the number of recovered instances to evaluate their approach, called GENAR. GENAR is an EA-based approach capable of obtaining an undetermined number of quantitative attributes in the antecedent of the rule. The same quality measures plus the comprehensibility and the amplitude of the intervals forming the rule were used to evaluate the GAR algorithm [34] (and in its extension [37]). The comprehensibility measure [22] is defined as the logarithm of the number of attributes in the consequent divided by the logarithm of the number of attributes in the rule. The amplitude measure is defined as the addition of the amplitudes for each interval of the attributes which belong to the rule divided by the number of attributes. The authors proposed another EA but, this time, it was necessary to select which attributes formed the antecedent and which one the consequent. Recently, a comparative analysis of the effectiveness in QAR extraction has been presented [7], in which the algorithms GENAR [33], GAR [34] and EARMGA [45] were applied to two different datasets showing their efficiency in terms of coverage and confidence. These five features (support, confidence, recovered, comprehensibility and amplitude) were also evaluated on a multi-objective Pareto-based EA called MODENAR [6]. The same authors proposed an optimization metaheuristic based on rough particle swarm techniques to mine QAR [5]. The fitness function was composed of four different objectives in both works: Support, confidence, comprehensibility of the rule (to be maximized) and the amplitude of the intervals that forms the rule (to be minimized).
Alternatively, the support and confidence have been combined with the interest to form fitness functions in some works [27], [28]. Their main particularity lies on the use of genetic algorithms to mine fuzzy association rules. The authors in [29] went one step further and used, in addition to the three measures aforementioned, the amplitude of the intervals as well as the comprehensibility of the rule to form the fitness function.
Finally, the authors in [15] proposed a fast and scalable multi-objective GA for mining AR from large datasets using parallel processing and a homogeneous dedicated network of workstations. The confidence, comprehensibility and interest were the measures maximized.
There is no unanimity in choosing the set of quality measures to be optimized, thus it becomes essential to propose a methodology to automatically select a subset of them whose optimization leads to the optimization of the entire set. Therefore, this work is focused on finding relations among different quality measures in order to determine which measures must be optimized in the fitness function. This way, it is expected that better rules can be extracted, regarding the whole set of measures and not only those included in the fitness function. To fulfill this task, this subset is generated according to a principal component analysis (PCA). The QARGA algorithm [31] has been used to check the new fitness function composed of the selected measures versus the original fitness function based on a weighting scheme that involved several evaluation measures such as support, confidence, number of attributes and amplitude of intervals of the attributes belonging to the rules. In particular, datasets from the public Bilkent University Function Approximation (BUFA) repository [24] have been used. Likewise, four different real-world datasets have been analyzed, specifically, datasets from biological, meteorological and seismological nature.
The remainder of the paper is as follows. Section 2 introduces the foundations underlying QAR. It also explores the most used measures found in the literature as well as some of their inherent drawbacks. Section 3 provides the statistical analysis conducted to select the target measures and a brief description of the QARGA's main features. The methodology introduced in previous Section is applied to a wide variety of datasets in order to determine the fitness function in Section 4. The results obtained by QARGA using both original and new fitness functions along with statistical tests can also be found in this section. Finally, Section 5 summarizes the achievements reached in this work and the conclusions drawn.
Section snippets
Quantitative association rules
This section provides a brief description on QAR including some definitions. In addition, some measures of interest proposed in the literature and some of their flaws are presented.
Methodology
This section presents the methodology based on PCA in order to determine a fitness function, which simultaneously optimizes the maximum number of quality measures possible. Also, a brief description of QARGA and its fitness function is provided.
Experimental results
The results obtained by the application of the QARGA algorithm with both original and new fitness functions to the datasets described in Section 4.2 are presented in this section. The goal of this experimentation is to show that better rules could be obtained when the measures selected with the help of PCA are considered in the fitness function.
First, a summary of the main parameters of configuration for QARGA can be found in Section 4.1. Section 4.2 provides a detailed description of all used
Conclusions
A study based on the PCA method has been proposed to obtain the set of measures to be included in a fitness function to discover QAR. In particular, the support of the rule, confidence, gain and accuracy are the measures that best summarize all the considered measures. Real-world climatological datasets, biological datasets and public datasets retrieved from the BUFA repository have been used to test the quality of the rules discovered by QARGA using a new fitness function that includes the set
Acknowledgments
The financial support from the Spanish Ministry of Science and Technology under project TIN2011-28956-C02 is acknowledged.
Maria Martínez-Ballesteros received the M.Sc. degree in Computer Engineering and the Ph.D. degree in Computer Science from the University of Seville, Spain, in 2012. Since 2009 she has been with the Department of Computer Science, University of Seville, where she is currently Assistant Professor. Her primary areas of interest are data mining, machine learning techniques, association rules and evolutionary computation.
References (45)
- et al.
Hybrid learning machines
Neurocomputing
(2009) - et al.
MODENARmulti-objective differential evolution algorithm for mining numeric association rules
Applied Soft Computing
(2008) - et al.
An algorithm to mine general association rules from tabular data
Information Sciences
(2009) - et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Molecular Cell
(1998) - et al.
Hybrid intelligent algorithms and applications
Information Sciences
(2010) - et al.
New trends and applications on hybrid artificial intelligence systems
Neurocomputing
(2012) - et al.
Multi-objective rule mining using genetic algorithms
Information Science
(2004) - et al.
Genetic algorithm based framework for mining fuzzy association rules
Fuzzy Sets and Systems
(2005) - et al.
Evolutionary association rules for total ozone content modeling from satellite observations
Chemometrics and Intelligent Laboratory Systems
(2011) - et al.
Pattern recognition to forecast seismic time series
Expert Systems with Applications
(2010)
An evolutionary algorithm to discover quantitative association rules from huge databases without the need for an a priori discretization
Expert Systems with Applications
Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence
Expert Systems with Applications
A model of inexact reasoning in medicine
Mathematical Biosciences
Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support
Expert Systems with Applications
An efficient genetic algorithm for automated mining of both positive and negative quantitative association rules
Soft Computing
Rough particle swarm optimization and its applications in data mining
Soft Computing
Analysis of the effectiveness of the genetic algorithms based on extraction of association rules
Fundamenta Informaticae
Keela software tool to assess evolutionary algorithms for data mining problems
Soft Computing
Cited by (32)
Scalability achievements for enumerative biclustering with online partitioning: Case studies involving mixed-attribute datasets
2021, Engineering Applications of Artificial IntelligenceCitation Excerpt :Besides, a bicluster may have a high internal consistency but not be relevant for some user’s applications. In the literature, there are many general metrics that can help the user to filter a biclustering solution (Henriques and Madeira, 2018; Horta and Campello, 2014; Kuznetsov and Makhalova, 2018; Lee et al., 2015; Martínez-Ballesteros et al., 2014; Zaki et al., 2014; Zimmermann, 2015), but the user can also incorporate domain knowledge to the filter (Henriques and Madeira, 2016). Moreover, a biclustering solution can have many biclusters with large overlap between them, which makes it even harder to pick up the ones that matter.
A survey of evolutionary computation for association rule mining
2020, Information SciencesDiscovering generalized design knowledge using a multi-objective evolutionary algorithm with generalization operators
2020, Expert Systems with ApplicationsCitation Excerpt :The major distinctions that can be made on different EAs for rule learning are based on the formulation of the objective functions, and how individual rules are encoded in chromosomes. The desired characteristics of extracted rules that can be used to define the fitness function include high statistical significance, low complexity, and high level of “interestingness”, which may be defined using various metrics (Lenca, Vaillant, Meyer, & Lallich, 2007; Martínez-Ballesteros, Martínez-Álvarez, Troncoso, & Riquelme, 2014; Sokolova & Lapalme, 2009; Tan, Kumar, & Srivastava, 2004). In classical ARM, support and confidence are typically used as measures of statistical significance and rule strength, respectively (Agrawal et al., 1993; Han et al., 2004).
Machine learning techniques to discover genes with potential prognosis role in Alzheimer's disease using different biological sources
2017, Information FusionCitation Excerpt :In this work, we have used the support, confidence, leverage, accuracy and gain measures to optimize and evaluate the quality of the QAR obtained by GarNet. The description and the mathematical definition can be found in [27]. Methods based on QAR have not been used to find gene associations in AD, although the technique has been used to analyze gene expression data [28] and other AD features [29].
Obtaining optimal quality measures for quantitative association rules
2016, NeurocomputingCitation Excerpt :Several datasets have been retrieved from the public BUFA repository [11]. In particular, the thirty-five public datasets from BUFA repository used in [17]. Note that Buying, Country, College, Education, Read and Usnews College have been preprocessed using K-means Imputation method proposed in [9] (available in the KEEL tool [6]) in order to deal with missing values.
Fundamentals of Data Science: Theory and Practice
2023, Fundamentals of Data Science: Theory and Practice
Maria Martínez-Ballesteros received the M.Sc. degree in Computer Engineering and the Ph.D. degree in Computer Science from the University of Seville, Spain, in 2012. Since 2009 she has been with the Department of Computer Science, University of Seville, where she is currently Assistant Professor. Her primary areas of interest are data mining, machine learning techniques, association rules and evolutionary computation.
Francisco Martínez-Álvarez received the M.Sc. degree in Telecommunications Engineering from the University of Seville, and the Ph.D. degree in Computer Engineering from the Pablo de Olavide University. He has been with the Department of Computer Science at the Pablo de Olavide University since 2007, where he is currently an Assistant Professor. His primary areas of interest are time series analysis, data mining, and evolutionary computation.
Alicia Troncoso was born in Carmona, Spain, in 1974. She received the Ph.D. degree in Computer Science from the University of Seville, Spain, in 2005. From 2002 to 2005, she was with the Department of Computer Science, University of Seville. Presently, she is an Associate Professor at the Pablo de Olavide University of Seville. Her primary areas of interest are time series analysis, control and forecasting, and optimization techniques.
Jose C. Riquelme received the M.Sc. degree in Mathematics and the Ph.D. degree in Computer Science from the University of Seville, Spain. Since 1987 he has been with the Department of Computer Science, University of Seville, where he is currently Full Professor. His primary areas of interest are data mining, machine learning techniques, and evolutionary computation.