Weighted association rule mining via a graph based connectivity model
Introduction
Association rule mining was introduced by Agrawal et al. [1] and is widely used to derive meaningful rules. It aims to extract interesting correlations, frequent patterns and associations among sets of items in transaction databases. Although the original motivation for association rules was to analyze supermarket transaction data they have been extensively used across a wide range of different application areas which include bioinformatics, text mining, web usage mining, telecommunications, medical disease diagnosis and education.
However, classical rule mining techniques such as Apriori and its variants are vulnerable to the rule explosion problem. Rule explosion occurs as the number of possible combinations of rule terms grows exponentially with the number of items in the dataset. This places a burden on the decision maker who has to sift through a large number of rules in order to find the rules that are of interest. One simple method of dealing with this problem is to set the rule support threshold to a sufficiently high level in order to produce a more manageable rule base, but this method has the major drawback of omitting rare rules that occur with low levels of support but with a high level of confidence. For example consider the rule: {SD15_sold = high, MD15_bought = very low} → {fraudster} that was uncovered in an online auction environment, where SD15_sold refers to the standard deviation of prices of items sold during the last 15 days of trading while MD15_bought refers to the median price of items bought during the same period. This rules states that fraudsters are associated with a high degree of price fluctuations of items sold in the last 15 days of trading and at the same time have very low or no buying activity in the last 15 days of trading. The incidence of fraudsters is rare in an online auction environment but when it does happen it leads to financial loss to the buyer as the cash is sent online in advance. Knowledge of such fraud patterns enables auction houses to closely monitor such traders and minimize the possibility that they could perpetrate fraud.
There are two basic approaches that have been used by decision makers to assist association rule mining algorithms in producing interesting rules. The first method uses rule constraints such as aggregation constraints [14], [15]. For example, a user interested in mining dairy products may specify that all rules containing frequent itemsets involving dairy products whose total value exceeds $100 be returned. The use of such constraints can be effective in situations, where prior domain specific knowledge is available in advance of the mining process.
Another very different approach that is taken to filter interesting rules is to attach weights to items [5], [18], [23], [25]. High weights are attached to items of high importance such as high profit items in a market basket scenario. Having such domain specific knowledge would provide an accurate representation of the current reality. However, many application domains exist where such knowledge is either unavailable or impractical to obtain. Furthermore, even if the weights are known in advance the very specification of these weights constrains the rules generated to encapsulate only known patterns, thus inhibiting new insights from being uncovered.
Recent research has shown that it is possible to deduce the relative importance of items based on their interactions with each other [9]. The weights assignment process is underpinned by a “Valency model” that was proposed by Koh et al. [9]. The model considers two factors: purity and connectivity. The purity of an item is determined by the number of items that it is associated with over the entire transaction database, whereas connectivity represents the strength of the interactions between items. The Valency scheme does not assume any prior knowledge that is over and above what is needed by the classical Apriori approach. The weighting scheme produced by the Valency model was able to distinguish high value rules from those with lesser value. However the major limitation of this scheme is that it only takes into account the immediate locality when assigning weights to items. While immediate locality is a good starting point we believe that the field of interaction needs to extend well beyond the immediate neighborhood in order to capture the chain of interactions between items in the transaction graph before an item’s weight can truly be determined. Our results in Section 7 show that this is indeed the case when we compare our extended version of the Valency model with the original version.
In this paper we generalize and extend the Valency model in a number of different ways. Firstly, we expand the field of interaction of an item to include non neighboring items. The weight of any given item is recursively defined as a linear function of the weights of all items that it interacts with. This forms a system of linear equations which are solved by the Gaussian elimination method. Secondly, we redefine the manner in which Purity, one of the key elements of the original Valency model is computed. Thirdly, we subject the new model developed to extensive experimentation with both real world and synthetic data. The use of synthetic data allows us to control key parameters in the data which in turn enables us to demonstrate key properties of the extended model.
The rules produced by our extended Valency model are evaluated by a number of different schemes. Firstly we assess the impact of item weighting on rule quality by conducting a comparison with a standard rule generator that does not use item weighting. Secondly, we use Principal Components Analysis to evaluate the rules produced by our extended Valency model. The use of this evaluation scheme was motivated by the fact that none of the popularly used interest measures such as Confidence and Lift was able to capture differences between rules with highly weighted items from those with lower weighted ones. Thirdly we conduct two case studies on the Zoo and Soybean datasets and analyze the characteristics of the top ranked cliques of items generated by each of the models.
The rest of the paper is organized as follows. In the next section, we look at previous work in the area of weighted association rule mining. In Section 3 we give a formal definition of the weighted association rule mining problem. Section 4 presents an overview of the basic Valency model. We present our extended Valency model in Section 5. Section 6 discusses the evaluation scheme which we used to assess the performance of our extended Valency model. The results of our empirical study are presented in Section 7. Finally we summarize our research contributions in Section 8 and outline directions for future work.
Section snippets
Background
The classical association rule mining scheme while having outstanding success in many different application domains fails in certain important situations. This is primarily due to its strict adherence to the support and confidence framework. As such, traditional Apriori-like approaches do not deal satisfactorily with the rare items problem [10]. Items which are rare but co-occur together with high confidence levels are unlikely to reach the minimum support threshold and are therefore pruned
The Weighted Association Rule Mining (WARM) problem
In weighted association rule mining a weight is assigned to each item i, reflecting the relative importance of an item over other items that it is associated with. The weighted support of an item i is . Similar to traditional association rule mining, a weighted support threshold and a confidence threshold is assigned to measure the strength of the association rules produced. The weight of a k-itemset, X, is given by:Here a k-itemset, X, is considered an interesting
The Valency model and its extensions
The Valency model, as proposed by Koh et al. is based on the intuitive notion that an item should be weighted based on the strength of its connections to other items as well as the number of items that it is connected with. Two items are said to be connected if they have occurred together in at least one transaction. Items that appear often together when compared to their individual support have a high degree of connectivity and are thus weighted higher. Given two items i and k that co-occur
The extended valency weighting scheme: a neighborhood based weight propagation scheme
In this section we start by discussing an alternative counting scheme for determining purity. We show that the current scheme employed by the Valency scheme tends to devalue the effect of purity in a high dimensional high density data environment.
Next we propose a more generalized scheme for determining item weight. We show that the model reduces to a linear system that can be solved efficiently using standard linear solution methods. We also show that the linear model has a number of desirable
Rule base evaluation methodology
The quality of association rule bases are judged using a variety of different metrics that have been proposed in the literature. In our first set of experiments we use standard metrics such as null invariance measures to evaluate the quality of the rules produced by the extended Valency model. In the rest of our experimentation we concentrate on a measure proposed by Koh et al. that is based on Principal Components Analysis (PCA) that quantifies rules by using the Eigen feature vectors to
Empirical study
Our empirical study is divided into three main parts. Firstly, we compare the extended Valency model with the original Valency model. We make use of PCA rule quantization as described in the previous section for this purpose. We also conduct two detailed case studies on the Zoo and Soybean datasets to contrast the weight rankings produced by the two schemes and to study the implications of widening the field of interaction (as done with the Extended Valency model) on item ranking.
Secondly, we
Conclusions and future work
This research has demonstrated the viability of inferring item weights given the nature of interactions between the items. We significantly enhanced the Valency model and proved formally that it has a number of desirable properties. Our experimentation showed that enhancements such as expanding the field of interaction to include item neighborhoods contributed to an increase in the quality of the rules produced when compared to the basic Valency model. Although our aim in enhancing the Valency
References (26)
- et al.
A framework for mining interesting high utility patterns with a strong frequency affinity
Information Sciences
(2011) Centrality in social networks: conceptual clarification
Social Networks
(1979)- et al.
Novel alarm correlation analysis system based on association rules mining in telecommunication networks
Information Sciences
(2010) Efficient mining of weighted interesting patterns with a strong weight and/or support affinity
Information Sciences
(2007)- R. Agrawal, T. Imielinski, A.N. Swami, Mining association rules between sets of items in large databases, in: P....
- et al.
Mining association rules in very large clustered domains
Elsevier Information Systems
(2007) - et al.
Label propagation and quadratic criterion
- et al.
Mining association rules with weighted items
- et al.
Finding interesting association rules without support pruning
IEEE Transactions on Knowledge and Data Engineering
(2001) - et al.
Automatic item weight generation for pattern mining and its application
International Journal of Data Warehousing and Data Mining
(2011)
Valency based weighted association rule mining
Finding sporadic rules using Apriori-inverse
Mining weighted association rules
Intelligent Data Analysis
Cited by (40)
A novel software defect prediction approach via weighted classification based on association rule mining
2024, Engineering Applications of Artificial IntelligenceRegional innovation capability from a technology-oriented perspective: An analysis at industry level
2021, Computers in IndustryRoot cause analysis approach based on reverse cascading decomposition in QFD and fuzzy weight ARM for quality accidents
2020, Computers and Industrial EngineeringErasable pattern mining based on tree structures with damped window over data streams
2020, Engineering Applications of Artificial IntelligenceSoftware defect prediction based on correlation weighted class association rule mining
2020, Knowledge-Based SystemsCitation Excerpt :Nonetheless, it is difficult to obtain accurate weights for all features using subjective knowledge when the dataset has a large number of features, or when the domain knowledge is unavailable. Another paradigm based on heuristic weighted association rules is to automatically derive weights using the characteristics of the training dataset without relying on domain knowledge, for instance, maximum likelihood estimation weighting [37], extended Valency connection model weighting [19] and Hyperlink-Induced Topic Search (HITS) link-based analysis weighting [20]. Nevertheless, these methods have a few limitations.
A weighted N-list-based method for mining frequent weighted itemsets
2018, Expert Systems with Applications