Towards personalized recommendation by two-step modified Apriori data mining algorithm

doi:10.1016/j.eswa.2007.08.048

Expert Systems with Applications

Volume 35, Issue 3, October 2008, Pages 1422-1429

https://doi.org/10.1016/j.eswa.2007.08.048 Get rights and content

Abstract

In this paper a new method towards automatic personalized recommendation based on the behavior of a single user in accordance with all other users in web-based information systems is introduced. The proposal applies a modified version of the well-known Apriori data mining algorithm to the log files of a web site (primarily, an e-commerce or an e-learning site) to help the users to the selection of the best user-tailored links. The paper mainly analyzes the process of discovering association rules in this kind of big repositories and of transforming them into user-adapted recommendations by the two-step modified Apriori technique, which may be described as follows. A first pass of the modified Apriori algorithm verifies the existence of association rules in order to obtain a new repository of transactions that reflect the observed rules. A second pass of the proposed Apriori mechanism aims in discovering the rules that are really inter-associated. This way the behavior of a user is not determined by “what he does” but by “how he does”. Furthermore, an efficient implementation has been performed to obtain results in real-time. As soon as a user closes his session in the web system, all data are recalculated to take the recent interaction into account for the next recommendations. Early results have shown that it is possible to run this model in web sites of medium size.

Introduction

In recent years, companies have concentrated on understanding the needs and expectations of their customers and grouping the existing and potential customers into classes with the purpose of improving the efficiency of their marketing strategies and increasing their market share (Saglam, Salman, Sayin, & Türkay, 2006). Personalization has become a reality and is possible by using efficient methods of data mining and knowledge discovery (Kim & Cho, 2007). To date, a variety of recommendation techniques has been developed (Cho, Kim, & Kim, 2002). Through analyzing user related information, it is possible to make a more accurate analysis of customer’s interest or preference. In most cases, recommendation can be classified according to (1) whether customers for whom we want recommendations are all customers or selective customers, (2) whether the objective of recommendation is to predict how much a particular customer will like a particular product, or to identify a list of products that will be of interest to a given customer (top-N recommendation problem), and (3) whether the recommendation is accomplished at a specific time or persistently.

The abundance of large data collections and the need to extract hidden knowledge within them has triggered the development of algorithms to detect unknown patterns in data sets (Han & Kamber, 2001). A paper (Ozmutlu, Spink, & Ozmutlu, 2002) reports results from a study using Poisson sampling to develop a sampling strategy to demonstrate how sample sets selected by Poisson sampling statistically effectively represent the characteristics of the entire data set. Moreover, clustering analysis is a data mining technique developed for the purpose of identifying groups of entities that are similar to each other with respect to certain similarity measures. In the past, different ways to discover groups using clustering techniques have been proposed (Schafer, Konstan, & Riedl, 2001). Very often, they are based on different definitions of similarity measure to represent the closeness between users. Users can also be grouped based on the transactions they perform (Wang, Lim, & Hwang, 2006). In Perkowitz and Etzioni (2000) a cluster mining algorithm – an unsupervised algorithm for efficiently identifying a small set of high-quality (and possibly overlapping) clusters with limited coverage – is introduced.

Nevertheless, the existing researches could not afford to give a formal way for capturing individual customer’s preference or associations among products through web usage mining. Given a set of transactions where each transaction is a set of items (itemset), an association rule implies the form X ⇒ Y, where X and Y are itemsets; X and Y are called the body and the head, respectively. The support for the association rule X ⇒ Y is the percentage of transactions that contain both itemset X and Y among all transactions. The confidence for the rule X ⇒ Y is the percentage of transactions that contain itemset Y among transaction that contain itemset X. The support represents the usefulness of the discovered rule and the confidence represents certainty of the rule. Association rule mining is the discovery of all association rules that are above a user-specified minimum support minsup and minimum confidence minconf (Tseng & Lin, 2007). Apriori algorithm is one of the prevalent techniques used to find association rules (Agrawal et al., 1993, Agrawal and Srikant, 1994). Apriori operates in two phases. In the first phase, all itemsets with minimum support (frequent itemsets) are generated. This phase utilizes the downward closure property of support. In other words, if an itemset of size k is a frequent itemset, then all the itemsets below (k − 1) size must also be frequent itemsets. Using this property, candidate itemsets of size k are generated from the set of frequent itemsets of size (k − 1) by imposing the constraint that all subsets of size (k − 1) of any candidate itemset must be present in the set of frequent itemsets of size (k − 1). The second phase of the algorithm generates rules from the set of all frequent itemsets.

Association rule mining, as originally proposed in Agrawal et al. (1993) with its Apriori algorithm, has developed into an active research area. Association rule discovery and classification are analogous tasks in data mining, with the exception that classification main aim is the prediction of class labels, while association rule mining discovers associations between attribute values in a data set (Thabtah, Cowling, & Hammoud, 2006). Many additional algorithms have been proposed for association rule mining (e.g. Pujari, 2001, Lin and Kedem, 2002). End users of association rule mining tools encounter several well-known problems in practice. First, the algorithms do not always return the results in a reasonable time. A fuzzy mining algorithm based on the AprioriTid approach to find fuzzy association rules from given quantitative transactions has been proposed for reduced time complexity (Hong, Kuo, & Wang, 2004). Further, the association rules sets are sometimes very large. In Palshikar, Kale, and Apte (2007) a concept called a heavy itemset is proposed to compactly represent the association rules. An algorithm named as BitTableFI (Dong & Han, 2007) has significant difference from the Apriori and all other algorithms extended from Apriori. It compresses the database into BitTable, and with the special data structure, candidate itemsets generation and support count can be performed quickly. Also, mining association rules with multiple minimum supports is an important generalization of the association rule mining problem. Instead of setting a single minimum support threshold for all items, (Liu, Hsu, & Ma, 1999) allow users to specify multiple minimum supports to reflect the natures of the items, and an Apriori-based algorithm, named MSapriori, is developed to mine all frequent itemsets. In a recent paper (Hu & Chen, 2006), the same problem is suited but with two additional improvements. Another approach is the cluster-based association rule (CBAR) method (Tsay & Chiang, 2005), aimed to create cluster tables by scanning the database once, and then clustering the transaction records to the kth cluster table, where the length of a record is k. Moreover, the large itemsets are generated by contrasts with the partial cluster tables. This not only prunes considerable amounts of data reducing the time needed to perform data scans and requiring less contrast, but also ensures the correctness of the mined results. In Lee, Hong, and Lin (2005) another point of view about defining the minimum supports of itemsets when items have different minimum supports is provided. The maximum constraint is used, and then a simple algorithm based on the Apriori approach to find the large-itemsets and association rules under this constraint is introduced. Another interesting proposal is to utilize methods and techniques from Information Retrieval (IR) in order to assist data mining functions (Kouris, Makris, & Tsakalidis, 2005).

Section snippets

The Apriori2 approach

As stated before, the discovery of association rules between items of a transaction set has been undertaken by many researchers. However, when defining user behavior patterns for one service (like a supermarket, an e-learning site, or simply any website) we should not only base on the analysis of the items composing their transactions. For instance, the behavior of a user in a website can not be well-measured by the pages he visits (what); we also need to know the way the user visits these

Conclusions

The proposal introduces a modified version of the well-known Apriori data mining algorithm to the log files of a web site to guide the users towards the selection of the best user-tailored links. The paper mainly analyzes the process of discovering association rules in this kind of big repositories

Acknowledgement

This work is supported in part by the Spanish Junta de Comunidades de Castilla-La Mancha PAI06-0093 grant.

References (22)

Y.H. Cho et al.
A personalized recommender system based on web usage mining and decision tree induction
Expert Systems with Applications
(2002)
J. Dong et al.
BitTableFI: An efficient mining frequent itemsets algorithm
Knowledge-Based Systems
(2007)
T.P. Hong et al.
A fuzzy AprioriTid mining algorithm with reduced computational time
Applied Soft Computing
(2004)
Y.H. Hu et al.
Mining association rules with multiple minimum supports: A new mining algorithm and a support tuning mechanism
Decision Support Systems
(2006)
K.J. Kim et al.
Personalized mining of web documents using link structures and fuzzy concept networks
Applied Soft Computing
(2007)
I.N. Kouris et al.
Using Information Retrieval techniques for supporting data mining
Data & Knowledge Engineering
(2005)
Y.C. Lee et al.
Mining association rules with multiple minimum supports using maximum constraints
International Journal of Approximate Reasoning
(2005)
H.C. Ozmutlu et al.
Analysis of large data logs: An application of Poisson sampling on excite web queries
Information Processing and Management
(2002)
G.K. Palshikar et al.
Association rules mining using heavy itemsets
Data & Knowledge Engineering
(2007)
M. Perkowitz et al.
Towards adaptive Web sites: Conceptual framework and case study
Artificial Intelligence
(2000)

B. Saglam et al.

A mixed-integer programming approach to the clustering problem with an application in customer segmentation

European Journal of Operational Research

(2006)

Cited by (84)

Corrosion main control factors and corrosion degree prediction charts in H<inf>2</inf>S and CO<inf>2</inf> coexisting associated gas pipelines
2022, Materials Chemistry and Physics
Citation Excerpt :
After generating frequent itemsets, the rules that meet the confidence threshold are extracted by confidence based pruning technology. Combined with Pearson correlation coefficient method and Apriori algorithm, the main control factors of corrosion and mining and the correlation of main control factors of corrosion were determined, as shown in Fig. 5 [14,33]. High-temperature and high-pressure dynamic reactor experiment was used to verify the association rules of main corrosion control factors.
The pipeline corrosive environment in T oil field contains H₂S and CO₂, with 96 corrosion perforations. OLGA software was used to simulate the internal flow environment. Combined with Pearson correlation coefficient analysis method and the Apriori algorithm, the association rules between corrosion rate and main corrosion control factors were determined. The results show that corrosion perforation of associated gas pipelines mostly occurred at the effusion in low-lying and climbing sections. Corrosion in the associated gas pipelines were strongly related to liquid velocity and holdup, and moderately related to temperature, H₂S partial pressure and CO₂ partial pressure. 10 association rules with confidence ≥90% between main corrosion control factors and corrosion rate were mined. Finally, the three kinds of corrosion evaluation charts parameters: liquid velocity & holdup, temperature & P_CO2/P_H2S and liquid velocity & P_CO2/P_H2S were established respectively.
A data-driven analysis of frequent patterns and variable importance for streamflow trend attribution
2021, Advances in Water Resources
Identifying key driving forces for streamflow variation is essential for improving sustainable water resource management in terms of understanding how changes in the watershed translate to changes in streamflow. In this study, the relationships between trends in total annual streamflow and trends in watershed characteristics across the contiguous U.S. during 1981-2016 are investigated with data from 2,621 USGS gages. The regions of homogeneous hydrologic change, i.e. watersheds that are undergoing similar statistically significant streamflow trends, are delineated and frequent pattern mining (i.e. Apriori algorithm) and variable importance (i.e. Random Forest) are used to derive the key driving forces for these regions. As expected, the trends in streamflow are highly associated with the trends of precipitation. In contrast, the influences of anthropogenic factors vary substantially across regions. Particularly, the influence of water use change tends to be significant in the regions dominated by agricultural land, e.g. Dakotas. The importance of land use change is highlighted in the regions with relatively large forest coverage, e.g. Northeast. However, these important identified water use changes are not frequently associated with the increasing streamflow in sub-regions, e.g. Great Lakes, and thus the significance of the water use impacts are site-specific. Therefore, the changes in climate and land use are frequently and importantly identified together in the sub-regions with increasing streamflow, which can be collectively used to discover the major causes of the streamflow trends in those regions. Although the impacts of changing water use are highlighted in the Southwest, climate trends are primarily responsible for the decreasing streamflow.
Automatic CV processing for scientific research using data mining algorithm
2020, Journal of King Saud University - Computer and Information Sciences
To manage and measure the performance of scientific research at the university, managers or policymakers need synthetic indicators that are cleverly grouped in several indicators which aim to offer to the leaders the necessary tools, so as to improve scientific research. The governance of information system has posed a serious challenge for the leadership of universities especially through the use of decision aid. In order to improve the information system, especially scientific research, a study of automatizing curriculum vitae for researchers of different disciplines belonging to various research laboratories is discussed in this paper. The use of natural language processing with data mining classifier Decision Tree is presented in order to predict the field of work of each researcher. The choice of Decision Tree classifier among One Rule classifier and Naive Bayes classifier is not arbitrary; it is chosen by comparing performance metrics- such as Precision, Recall, F-measure, Correctly Classify Instance, Incorrectly Classify Instance, Kappa Statistic, Root Mean Squad Error, Relative Absolute Error, and Root Relative Squad Error.
CD-CARS: Cross-domain context-aware recommender systems
2019, Expert Systems with Applications
Citation Excerpt :
In order to alleviate this problem, some techniques could be used. For example, using association rule mining (Lazcorreta, Botella, and Fernandez-Caballero, 2008; Soysal, 2015) to discover usage patterns between different domains and contexts (e.g. we could infer that users who like to read romance books on weekdays also like to watch romance movies on weekdays). Thus, we could make enhancement of the category preferences tensor by using association rules to infer other item categories preferred by the users according to the possible contexts.
In this paper, we address two research topics in Recommender Systems (RSs) which have been developed in parallel without a deeper integration: Cross-Domain RS (CDRS) and Context-Aware RS (CARS). CDRS have emerged to enhance the quality of recommendations in a target domain by leveraging sources of information in different domains. CDRS are especially useful to address cold-start, sparsity and diversity problems in target domains with scarce information. CARS, on its turn, have been proposed to consider contextual information for recommendations. Such systems are suitable when the users’ interests change according to factors like time, location, among others. By combining these two approaches, better RSs can be developed, considering both the availability of useful data from multiple domains and the use of contextual information. In this paper, we formalize the combination of CDRS and CARS, which represents a more systematic integration of these approaches compared to previous work. Based on this formulation, we developed novel RSs techniques, named CD-CARS. To evaluate the developed CD-CARS techniques, we performed extensive experimentation through real datasets taking into account several scenarios. The recommendations were evaluated in terms of predictive and ranking performance, respectively achieving up to 62.6% and 45%, depending on the scenario, in comparison to traditional cross-domain collaborative filtering techniques. Therefore, the experimental results have shown that the integration of techniques developed in isolation can be useful in a variety of situations, in which recommendations can be improved by information gathered from different sources and can be refined by considering specific contextual information.
Spatio-Temporal Characteristics and Analysis of Atmospheric Environmental Quality in Hainan Island from 2015 to 2021
2023, Advances in Transdisciplinary Engineering
Default Feature Selection in Credit Risk Modeling: Evidence From Chinese Small Enterprises
2023, SAGE Open

View all citing articles on Scopus

View full text

Towards personalized recommendation by two-step modified Apriori data mining algorithm

Abstract

Introduction

Section snippets

The Apriori2 approach

Conclusions

Acknowledgement

Expert Systems with Applications

Knowledge-Based Systems

Applied Soft Computing

Decision Support Systems

Applied Soft Computing

Data & Knowledge Engineering

International Journal of Approximate Reasoning

Information Processing and Management

Data & Knowledge Engineering

Artificial Intelligence

European Journal of Operational Research