Weighted association rule mining via a graph based connectivity model

doi:10.1016/j.ins.2012.07.001

Information Sciences

Volume 218, 1 January 2013, Pages 61-84

https://doi.org/10.1016/j.ins.2012.07.001 Get rights and content

Abstract

Association rule mining is an important data mining task that discovers relationships among items in a transaction database. Classical association rule mining approaches make the implicit assumption that an item’s importance is determined by its support. In contrast, Weighted Association Rule Mining (WARM) attempts to provide a notion of importance, or weight to individual items that are not based solely on item support. Previous approaches to Weighted Association Rule Mining assign item weights in a subjective manner, based on a user’s specialized knowledge of the underlying domain that is involved. Such approaches are infeasible when millions of items are present in a dataset, or when domain knowledge is unavailable. Furthermore, even when such domain information is available, a weight assignment based on subjective information constrains the knowledge discovered to fit with the weights assigned, thus inhibiting the discovery of new trends in the data. In this research we automate the process of weight assignment by formulating a linear model that captures relationships between items. This approach extends prior research based on the Valency model. We extend the Valency model by expanding the field of interaction beyond immediate neighborhoods and show that this leads to significant improvements in performance on a number of different metrics that we use.

Introduction

Association rule mining was introduced by Agrawal et al. [1] and is widely used to derive meaningful rules. It aims to extract interesting correlations, frequent patterns and associations among sets of items in transaction databases. Although the original motivation for association rules was to analyze supermarket transaction data they have been extensively used across a wide range of different application areas which include bioinformatics, text mining, web usage mining, telecommunications, medical disease diagnosis and education.

However, classical rule mining techniques such as Apriori and its variants are vulnerable to the rule explosion problem. Rule explosion occurs as the number of possible combinations of rule terms grows exponentially with the number of items in the dataset. This places a burden on the decision maker who has to sift through a large number of rules in order to find the rules that are of interest. One simple method of dealing with this problem is to set the rule support threshold to a sufficiently high level in order to produce a more manageable rule base, but this method has the major drawback of omitting rare rules that occur with low levels of support but with a high level of confidence. For example consider the rule: {SD15_sold = high, MD15_bought = very low} → {fraudster} that was uncovered in an online auction environment, where SD15_sold refers to the standard deviation of prices of items sold during the last 15 days of trading while MD15_bought refers to the median price of items bought during the same period. This rules states that fraudsters are associated with a high degree of price fluctuations of items sold in the last 15 days of trading and at the same time have very low or no buying activity in the last 15 days of trading. The incidence of fraudsters is rare in an online auction environment but when it does happen it leads to financial loss to the buyer as the cash is sent online in advance. Knowledge of such fraud patterns enables auction houses to closely monitor such traders and minimize the possibility that they could perpetrate fraud.

There are two basic approaches that have been used by decision makers to assist association rule mining algorithms in producing interesting rules. The first method uses rule constraints such as aggregation constraints [14], [15]. For example, a user interested in mining dairy products may specify that all rules containing frequent itemsets involving dairy products whose total value exceeds $100 be returned. The use of such constraints can be effective in situations, where prior domain specific knowledge is available in advance of the mining process.

Another very different approach that is taken to filter interesting rules is to attach weights to items [5], [18], [23], [25]. High weights are attached to items of high importance such as high profit items in a market basket scenario. Having such domain specific knowledge would provide an accurate representation of the current reality. However, many application domains exist where such knowledge is either unavailable or impractical to obtain. Furthermore, even if the weights are known in advance the very specification of these weights constrains the rules generated to encapsulate only known patterns, thus inhibiting new insights from being uncovered.

Recent research has shown that it is possible to deduce the relative importance of items based on their interactions with each other [9]. The weights assignment process is underpinned by a “Valency model” that was proposed by Koh et al. [9]. The model considers two factors: purity and connectivity. The purity of an item is determined by the number of items that it is associated with over the entire transaction database, whereas connectivity represents the strength of the interactions between items. The Valency scheme does not assume any prior knowledge that is over and above what is needed by the classical Apriori approach. The weighting scheme produced by the Valency model was able to distinguish high value rules from those with lesser value. However the major limitation of this scheme is that it only takes into account the immediate locality when assigning weights to items. While immediate locality is a good starting point we believe that the field of interaction needs to extend well beyond the immediate neighborhood in order to capture the chain of interactions between items in the transaction graph before an item’s weight can truly be determined. Our results in Section 7 show that this is indeed the case when we compare our extended version of the Valency model with the original version.

In this paper we generalize and extend the Valency model in a number of different ways. Firstly, we expand the field of interaction of an item to include non neighboring items. The weight of any given item is recursively defined as a linear function of the weights of all items that it interacts with. This forms a system of linear equations which are solved by the Gaussian elimination method. Secondly, we redefine the manner in which Purity, one of the key elements of the original Valency model is computed. Thirdly, we subject the new model developed to extensive experimentation with both real world and synthetic data. The use of synthetic data allows us to control key parameters in the data which in turn enables us to demonstrate key properties of the extended model.

The rules produced by our extended Valency model are evaluated by a number of different schemes. Firstly we assess the impact of item weighting on rule quality by conducting a comparison with a standard rule generator that does not use item weighting. Secondly, we use Principal Components Analysis to evaluate the rules produced by our extended Valency model. The use of this evaluation scheme was motivated by the fact that none of the popularly used interest measures such as Confidence and Lift was able to capture differences between rules with highly weighted items from those with lower weighted ones. Thirdly we conduct two case studies on the Zoo and Soybean datasets and analyze the characteristics of the top ranked cliques of items generated by each of the models.

The rest of the paper is organized as follows. In the next section, we look at previous work in the area of weighted association rule mining. In Section 3 we give a formal definition of the weighted association rule mining problem. Section 4 presents an overview of the basic Valency model. We present our extended Valency model in Section 5. Section 6 discusses the evaluation scheme which we used to assess the performance of our extended Valency model. The results of our empirical study are presented in Section 7. Finally we summarize our research contributions in Section 8 and outline directions for future work.

Section snippets

Background

The classical association rule mining scheme while having outstanding success in many different application domains fails in certain important situations. This is primarily due to its strict adherence to the support and confidence framework. As such, traditional Apriori-like approaches do not deal satisfactorily with the rare items problem [10]. Items which are rare but co-occur together with high confidence levels are unlikely to reach the minimum support threshold and are therefore pruned

The Weighted Association Rule Mining (WARM) problem

In weighted association rule mining a weight $w_{i}$ is assigned to each item i, reflecting the relative importance of an item over other items that it is associated with. The weighted support of an item i is $w_{i} \sup (i)$ . Similar to traditional association rule mining, a weighted support threshold and a confidence threshold is assigned to measure the strength of the association rules produced. The weight of a k-itemset, X, is given by: $(\sum_{i \in X} w_{i}) \sup (X)$ Here a k-itemset, X, is considered an interesting

The Valency model and its extensions

The Valency model, as proposed by Koh et al. is based on the intuitive notion that an item should be weighted based on the strength of its connections to other items as well as the number of items that it is connected with. Two items are said to be connected if they have occurred together in at least one transaction. Items that appear often together when compared to their individual support have a high degree of connectivity and are thus weighted higher. Given two items i and k that co-occur

The extended valency weighting scheme: a neighborhood based weight propagation scheme

In this section we start by discussing an alternative counting scheme for determining purity. We show that the current scheme employed by the Valency scheme tends to devalue the effect of purity in a high dimensional high density data environment.

Next we propose a more generalized scheme for determining item weight. We show that the model reduces to a linear system that can be solved efficiently using standard linear solution methods. We also show that the linear model has a number of desirable

Rule base evaluation methodology

The quality of association rule bases are judged using a variety of different metrics that have been proposed in the literature. In our first set of experiments we use standard metrics such as null invariance measures to evaluate the quality of the rules produced by the extended Valency model. In the rest of our experimentation we concentrate on a measure proposed by Koh et al. that is based on Principal Components Analysis (PCA) that quantifies rules by using the Eigen feature vectors to

Empirical study

Our empirical study is divided into three main parts. Firstly, we compare the extended Valency model with the original Valency model. We make use of PCA rule quantization as described in the previous section for this purpose. We also conduct two detailed case studies on the Zoo and Soybean datasets to contrast the weight rankings produced by the two schemes and to study the implications of widening the field of interaction (as done with the Extended Valency model) on item ranking.

Secondly, we

Conclusions and future work

This research has demonstrated the viability of inferring item weights given the nature of interactions between the items. We significantly enhanced the Valency model and proved formally that it has a number of desirable properties. Our experimentation showed that enhancements such as expanding the field of interaction to include item neighborhoods contributed to an increase in the quality of the rules produced when compared to the basic Valency model. Although our aim in enhancing the Valency

References (26)

C.F. Ahmed et al.
A framework for mining interesting high utility patterns with a strong frequency affinity
Information Sciences
(2011)
L.C. Freeman
Centrality in social networks: conceptual clarification
Social Networks
(1979)
T. Li et al.
Novel alarm correlation analysis system based on association rules mining in telecommunication networks
Information Sciences
(2010)
U. Yun
Efficient mining of weighted interesting patterns with a strong weight and/or support affinity
Information Sciences
(2007)
R. Agrawal, T. Imielinski, A.N. Swami, Mining association rules between sets of items in large databases, in: P....
Y.M. Alexander Nanopoulos et al.
Mining association rules in very large clustered domains
Elsevier Information Systems
(2007)
Y. Bengio et al.
Label propagation and quadratic criterion
C.H. Cai et al.
Mining association rules with weighted items
E. Cohen et al.
Finding interesting association rules without support pruning
IEEE Transactions on Knowledge and Data Engineering
(2001)
Y.S. Koh et al.
Automatic item weight generation for pattern mining and its application
International Journal of Data Warehousing and Data Mining
(2011)

Y.S. Koh et al.

Valency based weighted association rule mining

Y.S. Koh et al.

Finding sporadic rules using Apriori-inverse

S. Lu et al.

Mining weighted association rules

Intelligent Data Analysis

(2001)

Cited by (40)

A novel software defect prediction approach via weighted classification based on association rule mining
2024, Engineering Applications of Artificial Intelligence
Software defect prediction technology is used to assist software practitioners in effectively allocating test resources and identifying hidden defects in a timely manner. However, the prediction of defect-prone software using association rule mining algorithms is limited because of the unbalanced distribution of defect data. Furthermore, although the existing weighted association rule mining approach considers item strength, the weight calculation still relies on expert experience and lacks fine granularity. We propose a novel software defect prediction approach based on mutual information and correlation coefficient weighted class association rule mining (MCWCAR). The MCWCAR model employs a cost-sensitive strategy and generates frequent itemsets according to three mining objectives while maintaining the original item distribution: defective class rules, non-defective class rules, and feature association relationships. During the weighted frequent itemset mining process, it combines feature selection and itemset screening to determine the appropriate feature combination through mutual information weighted support. Meanwhile, the correlation coefficient is applied to accurately depict the correlation between feature items and defect classes, serving as the weight to mine class association rules. Additionally, to ensure that interestingness measures have asymmetry and effectively represent negative associations under the condition of class imbalance, the $a d d e d$ $v a l u e$ is adopted in the filtering association rules. We conducted experiments on 27 open-source datasets and evaluated the performance differences between MCWCAR and state-of-the-art baseline classifiers. Experimental results demonstrate that the proposed algorithm significantly outperforms other baselines in terms of $B a l a n c e$ , $G m e a n$ , $M C C$ , and $F$ - $m e a s u r e$ .
Regional innovation capability from a technology-oriented perspective: An analysis at industry level
2021, Computers in Industry
Regional institutions for technology planning are widely recognized as one of the most important actors that make up the regional innovation systems. They help regional firms successfully execute the innovation systems by enabling them to effectively exploit internal and external knowledge. To facilitate their activities, this study analyzes regional innovation capability from a technology-oriented perspective. We first extract co-classification information from patent data and convert it into transaction data. Second, we generate association relationships between technology classes employing association rule mining and then create comprehensive influential spillovers from the generated rules using analytic network process. Finally, we figure out regional sectors that have a comparative advantage by examining gaps between regional and national innovation capability depicted by comprehensive influential spillovers. We expect that this study will contribute to facilitating the regional institutions to identify region-specific industries that have relative competitive advantages and to develop appropriate support strategies for these industries. Furthermore, this study will be a basis for building a systematic support system that aids regional experts in carrying out R&D planning activities due to its quantitative nature.
Root cause analysis approach based on reverse cascading decomposition in QFD and fuzzy weight ARM for quality accidents
2020, Computers and Industrial Engineering
Quality accidents (QAs) of high frequencies in various fields have caused large economic and reputational losses to manufacturers, and identification of the root causes of vicious QAs is a top priority and a major challenge for manufacturers. Especially in the era of big data, the large number of data could be collected from the product life cycle easily, these high-dimensional big data always bear so many un-correlation noise information, which has caused serious problem. The accurate and heuristic root cause analysis for QAs is an important and challenging task in exploring this mechanism due to the fuzzy and vague nature of the collected big quality data. Thus, in this study, a heuristic root cause identification solution based on the fuzzy weighted association rule mining (FWARM) for QAs is proposed. First, the formation mechanism of QAs and big quality accident data is expounded, and a big data driven root cause analysis framework of QAs is presented with the aid of reverse cascading decomposition in Quality Function Deployment (QFD). Second, principal component analysis (PCA) is adopted to eliminate redundancy and reduce data dimension of original process feature parameters from raw data in low-dimensional space so that the key variables as the potential root cause candidates can be extracted. Third, considering the fuzzy mechanism and vague nature of big data, a heuristic root cause identification approach based on FWARM is established, and the weight of nodes on the accident-relevance tree is computed by fuzzy weight coefficient. Finally, the proposed approach is verified with a case study of a quality accident analysis of a washing machine. Results shows that the proposed approach is conducive to heuristically identify the root causes of QAs in the context of big data.
Erasable pattern mining based on tree structures with damped window over data streams
2020, Engineering Applications of Artificial Intelligence
Several pattern mining methods have been proposed to process dynamic data streams because the data generated in industrial fields is continually accumulated. Erasable pattern mining techniques for processing dynamic data streams are needed to discover erasable patterns from dynamic data streams. In previous erasable pattern mining approaches suggested for dynamic data streams, all data are considered to have the same importance regardless of its timestamp. However, dynamic data streams have the characteristic that the new data is relatively more significant than the old data. In erasable pattern mining, one of the desired techniques is an approach in consideration of such characteristic of data streams. For this reason, we propose an erasable pattern mining algorithm over dynamic data streams based on the damped window model. Since the suggested technique considers the new data more important than the previous data, it can find more useful erasable patterns. In addition, erasable pattern mining based on the damped window model is conducted efficiently by employing the tree and table structures. In performance test, we present that our pruning techniques remove unnecessary operations related to invalid erasable patterns efficiently from damped-window-based data streams. Performance evaluation results using real datasets and synthetic datasets show that the proposed approach has good performance with regard to as execution time, pattern generation, and scalability by comparing between the suggested technique and the state of the art algorithms.
Software defect prediction based on correlation weighted class association rule mining
2020, Knowledge-Based Systems
Citation Excerpt :
Nonetheless, it is difficult to obtain accurate weights for all features using subjective knowledge when the dataset has a large number of features, or when the domain knowledge is unavailable. Another paradigm based on heuristic weighted association rules is to automatically derive weights using the characteristics of the training dataset without relying on domain knowledge, for instance, maximum likelihood estimation weighting [37], extended Valency connection model weighting [19] and Hyperlink-Induced Topic Search (HITS) link-based analysis weighting [20]. Nevertheless, these methods have a few limitations.
Software defect prediction based on supervised learning plays a crucial role in guiding software testing for resource allocation. In particular, it is worth noticing that using associative classification with high accuracy and comprehensibility can predict defects. But owing to the imbalance data distribution inherent, it is easy to generate a large number of non-defective class association rules, but the defective class association rules are easily ignored. Furthermore, classical associative classification algorithms mainly measure the interestingness of rules by the occurrence frequency, such as support and confidence, without considering the importance of features, resulting in combinations of the insignificant frequent itemset. This promotes the generation of weighted associative classification. However, the feature weighting based on domain knowledge is subjective and unsuitable for a high dimensional dataset. Hence, we present a novel software defect prediction model based on correlation weighted class association rule mining (CWCAR). It leverages a multi-weighted supports-based framework rather than the traditional support-confidence approach to handle class imbalance and utilizes the correlation-based heuristic approach to assign feature weight. Besides, we also optimize the ranking, pruning and prediction stages based on weighted support. Results show that CWCAR is significantly superior to state-of-the-art classifiers in terms of $B a l a n c e$ , $M C C$ , and $G m e a n$ .
A weighted N-list-based method for mining frequent weighted itemsets
2018, Expert Systems with Applications
Mining frequent itemsets (FIs) is an important problem in the field of data mining, and thus there have been many different methods proposed to solve this problem. However, mining FIs usually works on binary databases and has a limitation that is only concerned with the appearance of items regardless of their importance. In practical applications, items often have different importance depending on their values or meanings, and that leads to the emergence of weighted databases. In this paper, we propose a new method for mining frequent weighted itemsets (FWIs) from a weighted database by using the weighted N-list structure (WN-list), an extension of the N-list. Some theorems are proposed to calculate the weighted supports of itemsets fast, and then an algorithm is built based on these theorems for efficiently mining FWIs. The experimental results show that the proposed method outperforms existing methods, especially when run on very large and sparse databases.

View all citing articles on Scopus

View full text

Weighted association rule mining via a graph based connectivity model

Abstract

Introduction

Section snippets

Background

The Weighted Association Rule Mining (WARM) problem

The Valency model and its extensions

The extended valency weighting scheme: a neighborhood based weight propagation scheme

Rule base evaluation methodology

Empirical study

Conclusions and future work

Information Sciences

Social Networks

Information Sciences

Information Sciences

Mining association rules in very large clustered domains

Elsevier Information Systems

Label propagation and quadratic criterion

Mining association rules with weighted items

Finding interesting association rules without support pruning

IEEE Transactions on Knowledge and Data Engineering

Automatic item weight generation for pattern mining and its application

International Journal of Data Warehousing and Data Mining

Valency based weighted association rule mining

Finding sporadic rules using Apriori-inverse

Mining weighted association rules

Intelligent Data Analysis