Towards personalized recommendation by two-step modified Apriori data mining algorithm

https://doi.org/10.1016/j.eswa.2007.08.048Get rights and content

Abstract

In this paper a new method towards automatic personalized recommendation based on the behavior of a single user in accordance with all other users in web-based information systems is introduced. The proposal applies a modified version of the well-known Apriori data mining algorithm to the log files of a web site (primarily, an e-commerce or an e-learning site) to help the users to the selection of the best user-tailored links. The paper mainly analyzes the process of discovering association rules in this kind of big repositories and of transforming them into user-adapted recommendations by the two-step modified Apriori technique, which may be described as follows. A first pass of the modified Apriori algorithm verifies the existence of association rules in order to obtain a new repository of transactions that reflect the observed rules. A second pass of the proposed Apriori mechanism aims in discovering the rules that are really inter-associated. This way the behavior of a user is not determined by “what he does” but by “how he does”. Furthermore, an efficient implementation has been performed to obtain results in real-time. As soon as a user closes his session in the web system, all data are recalculated to take the recent interaction into account for the next recommendations. Early results have shown that it is possible to run this model in web sites of medium size.

Introduction

In recent years, companies have concentrated on understanding the needs and expectations of their customers and grouping the existing and potential customers into classes with the purpose of improving the efficiency of their marketing strategies and increasing their market share (Saglam, Salman, Sayin, & Türkay, 2006). Personalization has become a reality and is possible by using efficient methods of data mining and knowledge discovery (Kim & Cho, 2007). To date, a variety of recommendation techniques has been developed (Cho, Kim, & Kim, 2002). Through analyzing user related information, it is possible to make a more accurate analysis of customer’s interest or preference. In most cases, recommendation can be classified according to (1) whether customers for whom we want recommendations are all customers or selective customers, (2) whether the objective of recommendation is to predict how much a particular customer will like a particular product, or to identify a list of products that will be of interest to a given customer (top-N recommendation problem), and (3) whether the recommendation is accomplished at a specific time or persistently.

The abundance of large data collections and the need to extract hidden knowledge within them has triggered the development of algorithms to detect unknown patterns in data sets (Han & Kamber, 2001). A paper (Ozmutlu, Spink, & Ozmutlu, 2002) reports results from a study using Poisson sampling to develop a sampling strategy to demonstrate how sample sets selected by Poisson sampling statistically effectively represent the characteristics of the entire data set. Moreover, clustering analysis is a data mining technique developed for the purpose of identifying groups of entities that are similar to each other with respect to certain similarity measures. In the past, different ways to discover groups using clustering techniques have been proposed (Schafer, Konstan, & Riedl, 2001). Very often, they are based on different definitions of similarity measure to represent the closeness between users. Users can also be grouped based on the transactions they perform (Wang, Lim, & Hwang, 2006). In Perkowitz and Etzioni (2000) a cluster mining algorithm – an unsupervised algorithm for efficiently identifying a small set of high-quality (and possibly overlapping) clusters with limited coverage – is introduced.

Nevertheless, the existing researches could not afford to give a formal way for capturing individual customer’s preference or associations among products through web usage mining. Given a set of transactions where each transaction is a set of items (itemset), an association rule implies the form X  Y, where X and Y are itemsets; X and Y are called the body and the head, respectively. The support for the association rule X  Y is the percentage of transactions that contain both itemset X and Y among all transactions. The confidence for the rule X  Y is the percentage of transactions that contain itemset Y among transaction that contain itemset X. The support represents the usefulness of the discovered rule and the confidence represents certainty of the rule. Association rule mining is the discovery of all association rules that are above a user-specified minimum support minsup and minimum confidence minconf (Tseng & Lin, 2007). Apriori algorithm is one of the prevalent techniques used to find association rules (Agrawal et al., 1993, Agrawal and Srikant, 1994). Apriori operates in two phases. In the first phase, all itemsets with minimum support (frequent itemsets) are generated. This phase utilizes the downward closure property of support. In other words, if an itemset of size k is a frequent itemset, then all the itemsets below (k  1) size must also be frequent itemsets. Using this property, candidate itemsets of size k are generated from the set of frequent itemsets of size (k  1) by imposing the constraint that all subsets of size (k  1) of any candidate itemset must be present in the set of frequent itemsets of size (k  1). The second phase of the algorithm generates rules from the set of all frequent itemsets.

Association rule mining, as originally proposed in Agrawal et al. (1993) with its Apriori algorithm, has developed into an active research area. Association rule discovery and classification are analogous tasks in data mining, with the exception that classification main aim is the prediction of class labels, while association rule mining discovers associations between attribute values in a data set (Thabtah, Cowling, & Hammoud, 2006). Many additional algorithms have been proposed for association rule mining (e.g. Pujari, 2001, Lin and Kedem, 2002). End users of association rule mining tools encounter several well-known problems in practice. First, the algorithms do not always return the results in a reasonable time. A fuzzy mining algorithm based on the AprioriTid approach to find fuzzy association rules from given quantitative transactions has been proposed for reduced time complexity (Hong, Kuo, & Wang, 2004). Further, the association rules sets are sometimes very large. In Palshikar, Kale, and Apte (2007) a concept called a heavy itemset is proposed to compactly represent the association rules. An algorithm named as BitTableFI (Dong & Han, 2007) has significant difference from the Apriori and all other algorithms extended from Apriori. It compresses the database into BitTable, and with the special data structure, candidate itemsets generation and support count can be performed quickly. Also, mining association rules with multiple minimum supports is an important generalization of the association rule mining problem. Instead of setting a single minimum support threshold for all items, (Liu, Hsu, & Ma, 1999) allow users to specify multiple minimum supports to reflect the natures of the items, and an Apriori-based algorithm, named MSapriori, is developed to mine all frequent itemsets. In a recent paper (Hu & Chen, 2006), the same problem is suited but with two additional improvements. Another approach is the cluster-based association rule (CBAR) method (Tsay & Chiang, 2005), aimed to create cluster tables by scanning the database once, and then clustering the transaction records to the kth cluster table, where the length of a record is k. Moreover, the large itemsets are generated by contrasts with the partial cluster tables. This not only prunes considerable amounts of data reducing the time needed to perform data scans and requiring less contrast, but also ensures the correctness of the mined results. In Lee, Hong, and Lin (2005) another point of view about defining the minimum supports of itemsets when items have different minimum supports is provided. The maximum constraint is used, and then a simple algorithm based on the Apriori approach to find the large-itemsets and association rules under this constraint is introduced. Another interesting proposal is to utilize methods and techniques from Information Retrieval (IR) in order to assist data mining functions (Kouris, Makris, & Tsakalidis, 2005).

In this paper a new method towards automatic personalized recommendation based on the behavior of a single user in accordance with all other users in web-based information systems is introduced. The proposal applies a modified version of the well-known Apriori data mining algorithm to the log files of a web site to guide the users to the selection of the best user-tailored links. The paper mainly analyzes the process of discovering association rules in this kind of big repositories and of transforming them into user-adapted recommendations.

Section snippets

The Apriori2 approach

As stated before, the discovery of association rules between items of a transaction set has been undertaken by many researchers. However, when defining user behavior patterns for one service (like a supermarket, an e-learning site, or simply any website) we should not only base on the analysis of the items composing their transactions. For instance, the behavior of a user in a website can not be well-measured by the pages he visits (what); we also need to know the way the user visits these

Conclusions

In this paper a new method towards automatic personalized recommendation based on the behavior of a single user in accordance with all other users in web-based information systems has been introduced.

The proposal introduces a modified version of the well-known Apriori data mining algorithm to the log files of a web site to guide the users towards the selection of the best user-tailored links. The paper mainly analyzes the process of discovering association rules in this kind of big repositories

Acknowledgement

This work is supported in part by the Spanish Junta de Comunidades de Castilla-La Mancha PAI06-0093 grant.

References (22)

Cited by (84)

  • Corrosion main control factors and corrosion degree prediction charts in H<inf>2</inf>S and CO<inf>2</inf> coexisting associated gas pipelines

    2022, Materials Chemistry and Physics
    Citation Excerpt :

    After generating frequent itemsets, the rules that meet the confidence threshold are extracted by confidence based pruning technology. Combined with Pearson correlation coefficient method and Apriori algorithm, the main control factors of corrosion and mining and the correlation of main control factors of corrosion were determined, as shown in Fig. 5 [14,33]. High-temperature and high-pressure dynamic reactor experiment was used to verify the association rules of main corrosion control factors.

  • Automatic CV processing for scientific research using data mining algorithm

    2020, Journal of King Saud University - Computer and Information Sciences
  • CD-CARS: Cross-domain context-aware recommender systems

    2019, Expert Systems with Applications
    Citation Excerpt :

    In order to alleviate this problem, some techniques could be used. For example, using association rule mining (Lazcorreta, Botella, and Fernandez-Caballero, 2008; Soysal, 2015) to discover usage patterns between different domains and contexts (e.g. we could infer that users who like to read romance books on weekdays also like to watch romance movies on weekdays). Thus, we could make enhancement of the category preferences tensor by using association rules to infer other item categories preferred by the users according to the possible contexts.

View all citing articles on Scopus
View full text