Selecting the best measures to discover quantitative association rules

doi:10.1016/j.neucom.2013.01.056

Neurocomputing

Volume 126, 27 February 2014, Pages 3-14

https://doi.org/10.1016/j.neucom.2013.01.056 Get rights and content

Abstract

The majority of the existing techniques to mine association rules typically use the support and the confidence to evaluate the quality of the rules obtained. However, these two measures may not be sufficient to properly assess their quality due to some inherent drawbacks they present. A review of the literature reveals that there exist many measures to evaluate the quality of the rules, but that the simultaneous optimization of all measures is complex and might lead to poor results. In this work, a principal components analysis is applied to a set of measures that evaluate quantitative association rules' quality. From this analysis, a reduced subset of measures has been selected to be included in the fitness function in order to obtain better values for the whole set of quality measures, and not only for those included in the fitness function. This is a general-purpose methodology and can, therefore, be applied to the fitness function of any algorithm. To validate if better results are obtained when using the function fitness composed of the subset of measures proposed here, the existing QARGA algorithm has been applied to a wide variety of datasets. Finally, a comparative analysis of the results obtained by means of the application of QARGA with the original fitness function is provided, showing a remarkable improvement when the new one is used.

Introduction

Hybrid artificial intelligent systems are rapidly gaining relevance in the scientific community due to the ability shown to deal with real-life problems [1], [13], [14]. These systems combine the use of both extracted knowledge and raw data to solve problems.

High volume of data can be stored nowadays; therefore, the use of efficient computational techniques has become a task of the utmost importance. In this context, the discovery of association rules (AR) – and particularly of quantitative association rules (QAR) in this work – is a popular methodology that allows the discovery of significant and apparently hidden relations among variables that form databases [3], [4], [27], [28].

The AR extraction process consists in using a non-supervised strategy to explore data properties. The main goal pursuit is, then, to find groups of attributes appearing frequently together in a dataset, so to provide comprehensive rules able to explain the existing relations among them.

A review of the literature reveals that there exist many algorithms to find AR. Most of them are based on the methods proposed by Agrawal et al. such as AIS [2], Apriori [3] or SETM [26].

Nonetheless, there is another big group of techniques to extract AR that are based on evolutionary algorithms (EA). EA are search algorithms that generate solutions for optimization problems using techniques inspired in natural evolution [18], [23], in which a population of abstract representations (chromosomes) of candidate solutions (individuals) evolves toward better solutions. EA can be used to discover AR, since they offer several advantages for knowledge extraction and for rule induction processes [7].

The algorithms that discover AR are normally assessed by means of certain interestingness measures that are able to evaluate the quality of a rule. From all of them, support and confidence highlight although lift, gain, certainty factor or leverage are also indicators that provide useful information about the extracted rules.

A review on AR learning based on the use of EA applied to boolean, categorical, quantitative and fuzzy variables has been described in [16]. However, as this work is focused on quantitative variables only the works using this kind of data are reviewed in this section. Table 1 summarizes the measures used for both evaluation and optimization in several works recently published. From the observation of this table, one conclusion can be easily drawn: There is no uniformity on the selection of measures to assess the algorithms' performance.

For instance, an EA called EARMGA was used in [45] to obtain QAR. The confidence was the only objective to be optimized in the fitness function. To achieve this goal, the authors avoided the specification of the actual minimum support, which can be considered the main contribution of the work.

The combination of confidence and support as only quality measures can be found in several works. Hence, the work introduced in [43] proposes an approach to discover QAR by clustering items of a dataset and projecting the clusters into the domains of the quantitative attributes to form meaningful intervals. Also, the algorithm called QuantMiner [40] proposed a genetic algorithm to mine QAR and optimize support and confidence, by using a fitness function based on the gain measure proposed in [19].

The extraction of QAR has also been applied to the data streams field. A classifier, whose main novelty lied on its adaptability to on-line gathered data was presented in [36]. By contrast, a multi-objective approach was proposed in [39]. The algorithm did not consider the minimum support and confidence and applied the FP-tree algorithm [25]. The fitness function maximized both support and confidence of the rule. Finally, some works such as [9] have proposed the use of an extended set of operators to mine general association rules and have evaluated the proposal in terms of confidence and support.

Additionally to support and confidence, the authors of the work introduced in [33] used the number of recovered instances to evaluate their approach, called GENAR. GENAR is an EA-based approach capable of obtaining an undetermined number of quantitative attributes in the antecedent of the rule. The same quality measures plus the comprehensibility and the amplitude of the intervals forming the rule were used to evaluate the GAR algorithm [34] (and in its extension [37]). The comprehensibility measure [22] is defined as the logarithm of the number of attributes in the consequent divided by the logarithm of the number of attributes in the rule. The amplitude measure is defined as the addition of the amplitudes for each interval of the attributes which belong to the rule divided by the number of attributes. The authors proposed another EA but, this time, it was necessary to select which attributes formed the antecedent and which one the consequent. Recently, a comparative analysis of the effectiveness in QAR extraction has been presented [7], in which the algorithms GENAR [33], GAR [34] and EARMGA [45] were applied to two different datasets showing their efficiency in terms of coverage and confidence. These five features (support, confidence, recovered, comprehensibility and amplitude) were also evaluated on a multi-objective Pareto-based EA called MODENAR [6]. The same authors proposed an optimization metaheuristic based on rough particle swarm techniques to mine QAR [5]. The fitness function was composed of four different objectives in both works: Support, confidence, comprehensibility of the rule (to be maximized) and the amplitude of the intervals that forms the rule (to be minimized).

Alternatively, the support and confidence have been combined with the interest to form fitness functions in some works [27], [28]. Their main particularity lies on the use of genetic algorithms to mine fuzzy association rules. The authors in [29] went one step further and used, in addition to the three measures aforementioned, the amplitude of the intervals as well as the comprehensibility of the rule to form the fitness function.

Finally, the authors in [15] proposed a fast and scalable multi-objective GA for mining AR from large datasets using parallel processing and a homogeneous dedicated network of workstations. The confidence, comprehensibility and interest were the measures maximized.

There is no unanimity in choosing the set of quality measures to be optimized, thus it becomes essential to propose a methodology to automatically select a subset of them whose optimization leads to the optimization of the entire set. Therefore, this work is focused on finding relations among different quality measures in order to determine which measures must be optimized in the fitness function. This way, it is expected that better rules can be extracted, regarding the whole set of measures and not only those included in the fitness function. To fulfill this task, this subset is generated according to a principal component analysis (PCA). The QARGA algorithm [31] has been used to check the new fitness function composed of the selected measures versus the original fitness function based on a weighting scheme that involved several evaluation measures such as support, confidence, number of attributes and amplitude of intervals of the attributes belonging to the rules. In particular, datasets from the public Bilkent University Function Approximation (BUFA) repository [24] have been used. Likewise, four different real-world datasets have been analyzed, specifically, datasets from biological, meteorological and seismological nature.

The remainder of the paper is as follows. Section 2 introduces the foundations underlying QAR. It also explores the most used measures found in the literature as well as some of their inherent drawbacks. Section 3 provides the statistical analysis conducted to select the target measures and a brief description of the QARGA's main features. The methodology introduced in previous Section is applied to a wide variety of datasets in order to determine the fitness function in Section 4. The results obtained by QARGA using both original and new fitness functions along with statistical tests can also be found in this section. Finally, Section 5 summarizes the achievements reached in this work and the conclusions drawn.

Section snippets

Quantitative association rules

This section provides a brief description on QAR including some definitions. In addition, some measures of interest proposed in the literature and some of their flaws are presented.

Methodology

This section presents the methodology based on PCA in order to determine a fitness function, which simultaneously optimizes the maximum number of quality measures possible. Also, a brief description of QARGA and its fitness function is provided.

Experimental results

The results obtained by the application of the QARGA algorithm with both original and new fitness functions to the datasets described in Section 4.2 are presented in this section. The goal of this experimentation is to show that better rules could be obtained when the measures selected with the help of PCA are considered in the fitness function.

First, a summary of the main parameters of configuration for QARGA can be found in Section 4.1. Section 4.2 provides a detailed description of all used

Conclusions

A study based on the PCA method has been proposed to obtain the set of measures to be included in a fitness function to discover QAR. In particular, the support of the rule, confidence, gain and accuracy are the measures that best summarize all the considered measures. Real-world climatological datasets, biological datasets and public datasets retrieved from the BUFA repository have been used to test the quality of the rules discovered by QARGA using a new fitness function that includes the set

Acknowledgments

The financial support from the Spanish Ministry of Science and Technology under project TIN2011-28956-C02 is acknowledged.

Maria Martínez-Ballesteros received the M.Sc. degree in Computer Engineering and the Ph.D. degree in Computer Science from the University of Seville, Spain, in 2012. Since 2009 she has been with the Department of Computer Science, University of Seville, where she is currently Assistant Professor. Her primary areas of interest are data mining, machine learning techniques, association rules and evolutionary computation.

References (45)

A. Abraham et al.
Hybrid learning machines
Neurocomputing
(2009)
B. Alatas et al.
MODENARmulti-objective differential evolution algorithm for mining numeric association rules
Applied Soft Computing
(2008)
S. Ayubi et al.
An algorithm to mine general association rules from tabular data
Information Sciences
(2009)
R.J. Cho et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Molecular Cell
(1998)
E. Corchado et al.
Hybrid intelligent algorithms and applications
Information Sciences
(2010)
E. Corchado et al.
New trends and applications on hybrid artificial intelligence systems
Neurocomputing
(2012)
A. Ghosh et al.
Multi-objective rule mining using genetic algorithms
Information Science
(2004)
M. Kaya et al.
Genetic algorithm based framework for mining fuzzy association rules
Fuzzy Sets and Systems
(2005)
M. Martínez-Ballesteros et al.
Evolutionary association rules for total ozone content modeling from satellite observations
Chemometrics and Intelligent Laboratory Systems
(2011)
A. Morales-Esteban et al.
Pattern recognition to forecast seismic time series
Expert Systems with Applications
(2010)

V. Pachón Álvarez et al.

An evolutionary algorithm to discover quantitative association rules from huge databases without the need for an a priori discretization

Expert Systems with Applications

(2012)

H.R. Qodmanan et al.

Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence

Expert Systems with Applications

(2011)

E.H. Shortliffe et al.

A model of inexact reasoning in medicine

Mathematical Biosciences

(1975)

X. Yan et al.

Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support

Expert Systems with Applications

(2009)

R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings...

R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the...

B. Alatas et al.

An efficient genetic algorithm for automated mining of both positive and negative quantitative association rules

Soft Computing

(2006)

B. Alatas et al.

Rough particle swarm optimization and its applications in data mining

Soft Computing

(2008)

J. Alcalá-Fdez et al.

Analysis of the effectiveness of the genetic algorithms based on extraction of association rules

Fundamenta Informaticae

(2010)

J. Alcalá-Fdez et al.

Keela software tool to assess evolutionary algorithms for data mining problems

Soft Computing

(2009)

S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: generalizing association rules to correlations, in:...

S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in:...

Cited by (32)

Scalability achievements for enumerative biclustering with online partitioning: Case studies involving mixed-attribute datasets
2021, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Besides, a bicluster may have a high internal consistency but not be relevant for some user’s applications. In the literature, there are many general metrics that can help the user to filter a biclustering solution (Henriques and Madeira, 2018; Horta and Campello, 2014; Kuznetsov and Makhalova, 2018; Lee et al., 2015; Martínez-Ballesteros et al., 2014; Zaki et al., 2014; Zimmermann, 2015), but the user can also incorporate domain knowledge to the filter (Henriques and Madeira, 2016). Moreover, a biclustering solution can have many biclusters with large overlap between them, which makes it even harder to pick up the ones that matter.
Biclustering is a powerful data analysis technique and its concept is appealing in many domains, such as natural sciences and market basket analysis. To exemplify the wide range of biclustering applications, we can also mention recommender systems, educational data mining, emerging topic detection and counterfeit product detection. In this paper, we further extend RIn-Close_CVC, a biclustering algorithm capable of performing, in numerical datasets, an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns. By avoiding a priori partitioning and itemization of the dataset, RIn-Close_CVC implements an online partitioning, which is demonstrated here to guide to more informative biclustering results. The improved algorithm, called RIn-Close_CVC3, is characterized by: a drastic reduction in memory usage; a consistent gain in runtime; additional ability to handle datasets with missing values; and new skills to operate with attributes characterized by distinct distributions or even mixed data types. Moreover, RIn-Close_CVC3 keeps those four attractive properties of RIn-Close_CVC, as formally proved here. The experimental results include synthetic and real-world datasets used to perform scalability and sensitivity analyses, besides a comparative inquiry involving a priori and online partitioning. As a practical case study, a parsimonious set of relevant and interpretable mixed-attribute-type rules is obtained in the context of supervised descriptive pattern mining.
A survey of evolutionary computation for association rule mining
2020, Information Sciences
Association Rule Mining (ARM) is a significant task for discovering frequent patterns in data mining. It has achieved great success in a plethora of applications such as market basket, computer networks, recommendation systems, and healthcare. In the past few years, evolutionary computation-based ARM has emerged as one of the most popular research areas for addressing the high computation time of traditional ARM. Although numerous papers have been published, there is no comprehensive analysis of existing evolutionary ARM methodologies. In this paper, we review emerging research of evolutionary computation for ARM. We discuss the applications on evolutionary computations for different types of ARM approaches including numerical rules, fuzzy rules, high-utility itemsets, class association rules, and rare association rules. Evolutionary ARM algorithms were classified into four main groups in terms of the evolutionary approach, including evolution-based, swarm intelligence-based, physics-inspired, and hybrid approaches. Furthermore, we discuss the remaining challenges of evolutionary ARM and discuss its applications and future topics.
Discovering generalized design knowledge using a multi-objective evolutionary algorithm with generalization operators
2020, Expert Systems with Applications
Citation Excerpt :
The major distinctions that can be made on different EAs for rule learning are based on the formulation of the objective functions, and how individual rules are encoded in chromosomes. The desired characteristics of extracted rules that can be used to define the fitness function include high statistical significance, low complexity, and high level of “interestingness”, which may be defined using various metrics (Lenca, Vaillant, Meyer, & Lallich, 2007; Martínez-Ballesteros, Martínez-Álvarez, Troncoso, & Riquelme, 2014; Sokolova & Lapalme, 2009; Tan, Kumar, & Srivastava, 2004). In classical ARM, support and confidence are typically used as measures of statistical significance and rule strength, respectively (Agrawal et al., 1993; Han et al., 2004).
The early-phase design of complex systems is a challenging task, as a decision maker has to take into account the intricate relationships among different design variables. A popular way to help decision makers easily identify important design features is to use data mining. However, many of the existing algorithms output design features that are too complex (e.g., conjunction of many literals with unrelated predicates), making it difficult for a user to understand, remember, and apply these features to find better designs. In this paper, we introduce a new data mining method that extracts compact design features through knowledge generalization. The proposed method performs a search over the space of features using a multi-objective evolutionary algorithm that contains a set of generalization operators in addition to conventional evolutionary operators. Both variables and feature types are generalized by using an ontology defining a set of domain-specific concepts and relationships. Generalization leads to more compact and insightful features, as generalized knowledge encompasses wider concepts. A comparative experiment is conducted on a real-world system architecting problem to demonstrate the gain in compactness of the extracted features without significant reductions in predictive power.
Machine learning techniques to discover genes with potential prognosis role in Alzheimer's disease using different biological sources
2017, Information Fusion
Citation Excerpt :
In this work, we have used the support, confidence, leverage, accuracy and gain measures to optimize and evaluate the quality of the QAR obtained by GarNet. The description and the mathematical definition can be found in [27]. Methods based on QAR have not been used to find gene associations in AD, although the technique has been used to analyze gene expression data [28] and other AD features [29].
Alzheimer’s disease is a complex progressive neurodegenerative brain disorder, being its prevalence expected to rise over the next decades. Unconventional strategies for elucidating the genetic mechanisms are necessary due to its polygenic nature. In this work, the input information sources are five: a public DNA microarray that measures expression levels of control and patient samples, repositories of known genes associated to Alzheimer’s disease, additional data, Gene Ontology and finally, a literature review or expert knowledge to validate the results. As methodology to identify genes highly related to this disease, we present the integration of three machine learning techniques: particularly, we have used decision trees, quantitative association rules and hierarchical cluster to analyze Alzheimer’s disease gene expression profiles to identify genes highly linked to this neurodegenerative disease, through changes in their expression levels between control and patient samples. We propose an ensemble of decision trees and quantitative association rules to find the most suitable configurations of the multi-objective evolutionary algorithm GarNet, in order to overcome the complex parametrization intrinsic to this type of algorithms. To fulfill this goal, GarNet has been executed using multiple configuration settings and the well-known C4.5 has been used to find the minimum accuracy to be satisfied. Then, GarNet is rerun to identify dependencies between genes and their expression levels, so we are able to distinguish between healthy individuals and Alzheimer’s patients using the configurations that overcome the minimum threshold of accuracy defined by C4.5 algorithm. Finally, a hierarchical cluster analysis has been used to validate the obtained gene-Alzheimer’s Disease associations provided by GarNet. The results have shown that the obtained rules were able to successfully characterize the underlying information, grouping relevant genes for Alzheimer Disease. The genes reported by our approach provided two well defined groups that perfectly divided the samples between healthy and Alzheimer’s Disease patients. To prove the relevance of the obtained results, a statistical test and gene expression fold-change were used. Furthermore, this relevance has been summarized in a volcano plot, showing two clearly separated and significant groups of genes that are up or down-regulated in Alzheimer’s Disease patients. A biological knowledge integration phase was performed based on the information fusion of systematic literature review, enrichment Gene Ontology terms for the described genes found in the hippocampus of patients. Finally, a validation phase with additional data and a permutation test is carried out, being the results consistent with previous studies.
Obtaining optimal quality measures for quantitative association rules
2016, Neurocomputing
Citation Excerpt :
Several datasets have been retrieved from the public BUFA repository [11]. In particular, the thirty-five public datasets from BUFA repository used in [17]. Note that Buying, Country, College, Education, Read and Usnews College have been preprocessed using K-means Imputation method proposed in [9] (available in the KEEL tool [6]) in order to deal with missing values.
There exist several works in the literature in which fitness functions based on a combination of weighted measures for the discovery of association rules have been proposed. Nevertheless, some differences in the measures used to assess the quality of association rules could be obtained according to the values of the weights of the measures included in the fitness function. Therefore, user׳s decision is very important in order to specify the weights of the measures involved in the optimization process. This paper presents a study of well-known quality measures with regard to the weights of the measures that appear in a fitness function. In particular, the fitness function of an existing evolutionary algorithm called QARGA has been considered with the purpose of suggesting the values that should be assigned to the weights, depending on the set of measures to be optimized. As initial step, several experiments have been carried out from 35 public datasets in order to show how the weights for confidence, support, amplitude and number of attributes measures included in the fitness function have an influence on different quality measures according to several minimum support thresholds. Second, statistical tests have been conducted for evaluating when the differences in measures of the rules obtained by QARGA are significative, and thus, to provide the best weights to be considered depending on the group of measures to be optimized. Finally, the results obtained when using the recommended weights for two real-world applications related to ozone and earthquakes are reported.
Fundamentals of Data Science: Theory and Practice
2023, Fundamentals of Data Science: Theory and Practice

View all citing articles on Scopus

Francisco Martínez-Álvarez received the M.Sc. degree in Telecommunications Engineering from the University of Seville, and the Ph.D. degree in Computer Engineering from the Pablo de Olavide University. He has been with the Department of Computer Science at the Pablo de Olavide University since 2007, where he is currently an Assistant Professor. His primary areas of interest are time series analysis, data mining, and evolutionary computation.

Alicia Troncoso was born in Carmona, Spain, in 1974. She received the Ph.D. degree in Computer Science from the University of Seville, Spain, in 2005. From 2002 to 2005, she was with the Department of Computer Science, University of Seville. Presently, she is an Associate Professor at the Pablo de Olavide University of Seville. Her primary areas of interest are time series analysis, control and forecasting, and optimization techniques.

Jose C. Riquelme received the M.Sc. degree in Mathematics and the Ph.D. degree in Computer Science from the University of Seville, Spain. Since 1987 he has been with the Department of Computer Science, University of Seville, where he is currently Full Professor. His primary areas of interest are data mining, machine learning techniques, and evolutionary computation.

View full text

Selecting the best measures to discover quantitative association rules

Abstract

Introduction

Section snippets

Quantitative association rules

Methodology

Experimental results

Conclusions

Acknowledgments

Neurocomputing

Applied Soft Computing

Information Sciences

Molecular Cell

Information Sciences

Neurocomputing

Information Science

Fuzzy Sets and Systems

Chemometrics and Intelligent Laboratory Systems

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Mathematical Biosciences

Expert Systems with Applications

An efficient genetic algorithm for automated mining of both positive and negative quantitative association rules

Soft Computing

Rough particle swarm optimization and its applications in data mining

Soft Computing

Analysis of the effectiveness of the genetic algorithms based on extraction of association rules

Fundamenta Informaticae

Keela software tool to assess evolutionary algorithms for data mining problems

Soft Computing