An efficient algorithm for mining high utility itemsets with negative item values in large databases

doi:10.1016/j.amc.2009.05.066

Applied Mathematics and Computation

Volume 215, Issue 2, 15 September 2009, Pages 767-778

https://doi.org/10.1016/j.amc.2009.05.066 Get rights and content

Abstract

Utility itemsets typically consist of items with different values such as utilities, and the aim of utility mining is to identify the itemsets with highest utilities. In the past studies on utility mining, the values of utility itemsets were considered as positive. In some applications, however, an itemset may be associated with negative item values. Hence, discovery of high utility itemsets with negative item values is important for mining interesting patterns like association rules. In this paper, we propose a novel method, namely HUINIV (High Utility Itemsets with Negative Item Values)-Mine, for efficiently and effectively mining high utility itemsets from large databases with consideration of negative item values. To the best of our knowledge, this is the first work that considers the concept of negative item values in utility mining. The novel contribution of HUINIV-Mine is that it can effectively identify high utility itemsets by generating fewer high transaction-weighted utilization itemsets such that the execution time can be reduced substantially in mining the high utility itemsets. In this way, the process of discovering all high utility itemsets with consideration of negative item values can be accomplished effectively with less requirements on memory space and CPU I/O. This meets the critical requirements of temporal and spatial efficiency for mining high utility itemsets with negative item values. Through experimental evaluation, it is shown that HUINIV-Mine outperforms other methods substantially by generating much less candidate itemsets under different experimental conditions.

Introduction

Mining of association rules in large databases is a well studied technique in the field of data mining with typical methods like Apriori [1], [2]. The problem surrounding association rules mining can be decomposed into two steps. The first step involves finding all frequent itemsets (or large itemsets) in a database. Once the frequent itemsets are found, generating association rules is straightforward and can be accomplished in linear time.

Most methods in finding frequent itemsets are designed for traditional databases. However, the frequency of an itemset may not be a sufficient indicator of significance, because frequency reflects only the number of transactions in the database that contain that itemset. It does not reveal the utility of an itemset, which can be measured in terms of cost, profit, or other expressions of user preference. On the other hand, frequent itemsets may only contribute a small portion of the overall profit, whereas non-frequent itemsets may contribute a large portion of the profit. In reality, a retail business may be most interested in identifying its most valuable customers (customers who contribute a major fraction of the profits to the company). Hence, frequency is not sufficient to answer questions such as whether an itemset is highly profitable, or whether an itemset has a strong impact.

Utility mining is thus useful in a wide range of practical applications and was recently studied in [7], [14], [20], [22]. However, a retail business may sale item with negative value. For example, many super markets may promote certain items to attract customers. In this scenario customers may buy specific items and then receive free goods. Free goods result in negative value for super markets. However, supermarkets may earn higher profits from other items that are cross-promoted with these free items. This practice is common. For example, if a customer bought three of item A, he would then receive one free item B as a promotion from the supermarket. Suppose the supermarket gets five dollars of profit from each unit of item A sold, and loses two dollars for each unit of item B given away. Although giving away a unit of item B results in a loss of two dollars for supermarkets, they could possibly earn 15 dollars from the three units of item A that are cross-promoted with item B. The supermarket thus may have a net gain of 13 dollars from this promotion. This example demonstrates why we propose the concept of mining for negative item values in utility mining. This also motivates our research in developing a new scheme for finding high utility itemsets with negative item values (HUINIV) from large databases.

Recently, a utility mining model has been defined in [22]. Utility is a measure of how “useful” (i.e. “profitable”) an itemset is. The definition of the utility of an itemset X, u(X), states that it is equal to the sum of the utilities of X of all the transactions containing X. The goal of utility mining is to identify high utility itemsets, which drive a large portion of the total utility. Traditional association rules of mining models assume that the utility of each item is always 1 and that the quantity of sales is either 0 or 1; thus it is only a special case of utility mining in which the utility or the quantity of sales of each item can be any number. If u(X) is greater than a specified utility threshold, X is a high utility itemset; otherwise, it is a low utility itemset. Table 1 is an example of utility mining in a transaction database. The number associated with each transaction in Table 1a is the sales volume of each item, and the utility of each item is listed in Table 1b. For example, u({A, D}) = (1 × 5 + 2 × 6) + (3 × 5+5 × 6) + (3 × 5 + 1 × 6) = 83. {A, D} is a high utility itemset if the utility threshold is set at 80.

However, a high utility itemset may consist of low utility items. Another possibility is to adopt the level-wise searching schema that exists in fast algorithms, such as Apriori [3]. This algorithm does not apply to the utility mining model. For example, u(A) = 55 < 80, A is a low utility item, but its superset {A, D} is a high utility itemset. If Apriori is used to find high utility itemsets, all combinations of all items must be generated. Moreover, in order to discover a long pattern, the number of candidates is prohibitively large. The cost in terms of either computation time or memory is intolerable, regardless of the method utilized. The challenge of utility mining is not only in restricting the size of the candidate set but also in simplifying the computation used to calculate its utility. Another challenge of utility mining is finding high utility itemsets with negative item values from large databases.

In this paper, we explore the issue of efficiently mining high utility itemsets with negative item values in large databases. We propose an algorithm named HUINIV (High Utility Itemsets with Negative Item Values)-Mine that can discover high utility itemsets with negative item values from large databases both efficiently and effectively. The underlying idea behind the HUINIV-Mine algorithm is based on the principle of the Two-Phase algorithm [14] and augments with negative item value for mining high utility itemsets efficiently. The novel contribution of HUINIV-Mine is that it can efficiently identify the utility of itemsets in large database so that the execution time for producing high utility itemsets with negative item values can be substantially reduced. That is, HUINIV-Mine can discover high utility itemsets with negative item values using limited memory and comparatively less computation time by the candidate itemsets filter method. In this way, the process of discovering all high utility itemsets in which all transactions are negative can be achieved effectively with limited memory, less candidate itemsets, and CPU I/O. This meets the critical requirements of time and spatial efficiency for mining large databases. Through experimental evaluation, HUINIV-Mine is shown to produce fewer candidate itemsets in the process of finding high utility itemsets with negative item values, so it outperforms other methods in terms of efficiency. We found that the average improvement of HUINIV-Mine compared to the MEU algorithm is about 99.2%. Moreover, it also achieves high scalability in dealing with large databases. To the best of our knowledge, this is the first work to propose a negative item concept in utility mining and the first work on mining high utility itemsets with negative item values from large database.

The rest of this paper is organized as follows: Section 2 provides an overview of related work. Section 3 describes the proposed approach, HUINIV-Mine, for finding the high utility itemsets with negative item values. In Section 4, we describe our experimental results for evaluating the proposed method. The conclusion of the paper is provided in Section 5.

Section snippets

Related work

Apriori [3], DHP [16], Partition [13], and Anti-Skew Algorithms [17] have been proposed to find frequent itemsets using association rules mining. Many important applications have established the need for incremental mining. This is due to the increasing use of record-based databases to which data are being added continuously. Many algorithms like FUP [8], FUP₂ [9] and UWEP [4], [5] have been proposed to find frequent itemsets in incremental databases. The FUP algorithm updates the association

Proposed method: HUINIV-Mine

In this section, we present the HUINIV-Mine method. Section 3.1 describes the basic concept of HUINIV-Mine. Section 3.2 gives an example of mining temporal high utility itemsets. The procedure of the HUINIV-Mine algorithm is provided in Section 3.3.

Experimental evaluation

To evaluate the performance of HUINIV-Mine, we conducted experiments using synthetic datasets generated via a randomized dataset generator provided by IBM Quest [3]. However, the IBM Quest data generator only generates quantities of 0 or 1 for each item in a transaction. In order to fit databases into the scenario of utility mining, we randomly generate the quantity of each item in each transaction, ranging from 1 to 5; much like the model used in [14], [20]. Utility tables are also

Conclusions

In this paper, we addressed the problem of discovering high utility itemsets with negative item values in large databases, i.e., the itemsets containing negative item values that are larger than threshold in large databases. We propose a new approach, namely HUINIV-Mine, which can identify high utility itemsets with negative item values in large databases both efficiently and effectively. The novel contribution of HUINIV-Mine is that it can effectively identify high utility itemsets with

Acknowledgement

This research was supported by National Science Council, Taiwan, ROC under Grant No. NSC 97-2631-H-006-001 and NSC 96-2221-E-006-143-MY3.

References (22)

R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in Proceedings...
R. Agrawal et al.
Fast discovery of association rules
R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data...
N.F. Ayn, A.U. Tansel, E. Arun, An efficient algorithm to update large itemsets with early pruning, Technical Report...
N.F. Ayn, A.U. Tansel, E. Arun, An efficient algorithm to update large itemsets with early pruning, in: Proceedings of...
C. Bettini, X.S. Wang, S. Jajodia, Testing complex temporal relationships involving multiple granularities and its...
R. Chan, Q. Yang, Y. Shen, Mining high utility itemsets, in: Proceedings of IEEE ICDM, Florida,...
D. Cheung, J. Han, V. Ng, C.Y. Wong, Maintenance of discovered association rules in large databases: an incremental...
D. Cheung, S.D. Lee, B. Kao, A general incremental technique for updating discovered association rules, in: Proceedings...
Y. Chi, H. Wang, P.S. Yu, R. Richard, Muntz: moment: maintaining closed frequent itemsets over a stream sliding window,...

G. Das, K.I. Lin, H. Mannila, G. Renganathan, P. Smyth, Rule discovery from time series, in: Proceedings of the 4th ACM...

Cited by (77)

An efficient method for mining High-Utility itemsets from unstable negative profit databases
2024, Expert Systems with Applications
The study of High-Utility Itemset Mining (HUIM) and Frequent Itemset Mining (FIM) is crucial since it explains consumer behavior and offers actionable advice to improve business results. HUIM algorithms have been successfully established to identify high-utility itemsets, including those with negative utilities. The problem with these approaches is that they presume incorrectly that items with negative utility across transactions would always be losses. Products with positive profitability may seem negative when combined with other items to increase sales or reduce inventory. Using strict upper-bound approaches, this paper presents strategies for making database scanning more efficient and reducing the number of prospective candidates. We also prove that it is correct to use the proposed upper-bounds for pruning on several types of items in the database. Based on all the proposed solutions, we develop a novel algorithm to solve this problem efficiently. To demonstrate their efficiency, the algorithms are tested against states-of-art HUIM algorithm on diverse datasets with regard to size and characteristics with unstable negative profits.
Mining high-utility sequences with positive and negative values
2023, Information Sciences
Sequence pattern discovery is a fundamental topic in the domain of data mining. It has been widely used to solve various problems (e.g., behavior pattern discovery, gene pattern discovery in bioinformatics, user click pattern mining, etc.). High-utility sequence mining as a novel hot issue is more challenging and has generally attracted plenty of attention. Our paper focuses on mining high-utility sequences in a more complicated environment with high efficiency. Most of the previous methods for utility mining aim to find high-utility sequences suitable for items with positive values, but most real-world situations contain items with both positive and negative values. Several algorithms have been applied to the above sophisticated situation and can be used as our comparing algorithms. In this paper, we introduce the FHUSN (Fast mining High Utility Sequences with Negative item) algorithm to mine high-utility sequences in situations with or without negative utility values. FHUSN utilizes the new utility array to store data. Several new pruning strategies that apply to situations with or without negative values have been used to reduce search space. Experiments are carried out on several benchmark datasets, and experimental results illustrate that our method has better performance.
HLHUI: An improved version of local high utility itemset mining
2023, Procedia Computer Science
High utility itemsets (HUIs) have been emerged to address the main problems of frequent itemset mining, namely considering the same importance for all items of the dataset and ignoring the occurrence numbers of items within transactions during the mining process. Local and peak HUIs were defined to mine the itemsets which are useful and high utility during specific periods of time. In this paper, using some adopted definitions and strategies of HMiner [19], an improved version of LHUI method [33], called HLHUI (Hminer-based Local HUI mining), is introduced that mines local HUIs using a utility-list-based approach. Performance evaluations of the proposed method show that it can efficiently find useful itemsets.
EHMIN: Efficient approach of list based high-utility pattern mining with negative unit profits
2022, Expert Systems with Applications
High-utility pattern mining is an important sub-literature in the data mining literature. This literature discusses the discovery of useful pattern information from large databases by considering not only supports of patterns but also profits and quantities of items. This literature has the potential to be applied to various problems in the real world, so many methods for the improvement of the algorithm performance have been studied. Moreover, there have also been attempts to extend the flexibility of this literature. The traditional approaches in this literature considered the positive unit profits of items in a given database only. However, this literature can take extended flexibility into account by considering negative as well as positive unit profits of the items. In this paper, we suggest an efficient approach for mining high-utility patterns with negative unit profits. Moreover, the experimental performance tests, which are performed on various real and synthetic datasets in this paper, show that the proposed algorithm has a better performance than the state-of-the-art methods in this literature in terms of the runtime, memory usage, and scalability.
TKN: An efficient approach for discovering top-k high utility itemsets with positive or negative profits
2022, Information Sciences
Citation Excerpt :
Yet, these algorithms can not handle items with negative utilities. In 2009, HUINIV-Mine [5] was proposed based on the Two-Phase paradigm to process items with negative unit profits. This method presented a modified twu pruning property to diminish the itemsets tree and boost the mining process.
Top-k high utility itemsets (HUIs) mining permits discovering the required number of patterns - k, without having an optimal minimum utility threshold (i.e., minimum profit). Multiple top-k HUIs mining algorithms have been introduced with interesting results. However, these algorithms focus mainly on mining patterns from positive profit datasets, while few preliminary studies can handle datasets with negative profits. Moreover, conventional top-k HUI mining algorithms, that are meant for exploring positive profit datasets, perform poorly when mining top-k HUIs on highly dense and large datasets. In this paper, we propose TKN (efficiently mining Top-K HUIs with positive or Negative profits) which employs generalized and adaptive techniques to mine both positive and negative profit datasets effectively. The proposed approach adopts transactional projection and merging mechanisms to decrease the dataset traversing cost. Furthermore, several pruning and threshold elevating ideas are utilized to significantly narrow the exploration space. To highlight the reliability of the devised TKN, a series of extensive comparisons were conducted using two versions of six real datasets. The obtained results reveal that TKN is clearly superior in finding the required number of patterns, whether on positive or negative profit datasets, compared to the current cutting-edge competitors.
Mining Top-k High On-shelf Utility Itemsets Using Novel Threshold Raising Strategies
2024, ACM Transactions on Knowledge Discovery from Data

View all citing articles on Scopus

View full text