Abstract

Discovering access patterns from web log data is a typical sequential pattern mining application, and a lot of access pattern mining algorithms have been proposed. In this paper, we propose an improved approach of Gap-BIDE algorithm to extract user access patterns from web log data. Compared with the previous Gap-BIDE algorithm, a process of getting a large event set is proposed in the provided algorithm; the proposed approach can find out the frequent events by discarding the infrequent events which do not occur continuously in an accessing time before generating candidate patterns. In the experiment, we compare the previous access pattern mining algorithm with the proposed one, which shows that our approach is very efficient in discovering access patterns in large database.

1. Introduction

The web has become an important channel for conducting business transactions and e-commerce. Also, it provides a convenient means for us to communicate with each other worldwide. With the rapid development of web technology, the web has become an important and preferred platform for distributing and acquiring information. The data collected automatically by the web and application web servers represent the navigational behavior of web users, and such data is called web log data.

Web mining is a technology to discover and extract useful information from web log data. Because of the tremendous growth of information sources, increasing interest of various research communities, and the recent interest in e-commerce, the area of web mining has become vast and more interesting. It deals with data related to the web, such as data hidden in web contents, data presented on web pages, and data stored on web servers. Based on the kinds of data, there are three categories of web mining: web content mining, web structure mining, and web usage mining [1]. The Web usage data includes the data from web server access logs, proxy server logs, and browser logs. It is also known as web access patterns. Web usage mining tries to discover the access patterns from web log files. Web access tracking can be defined as web page history [2]; the mining task is a process of extracting interesting patterns in web access logs. There are so many techniques of mining web usage data including statistical analysis [3], association rules [4], sequential patterns [57], classification [810], and clustering [1113]. Access pattern mining is a popular approach of sequential pattern mining, which extracts frequent subsequences from a sequence database [14]. Further, discovering access patterns is an important challenge in the field of web mining. And the popular applications of access patterns mining are obtaining useful information of web users’ behavior.

A lot of studies have been proposed on access pattern mining for finding valuable knowledge from web log data, such as AprioriAll algorithm [15, 16] and GSP (generalized sequential pattern) algorithm [17]. All of above algorithms mine sequential patterns using a paradigm of candidate generate-and-test maintain a candidate set of already mined patterns in the mining process. When the data set is huge, it will generates a lot of candidate patterns. In other words, GSP algorithm needs much memory while the data set is large. The BIDE algorithms [18] mine frequent patterns without keeping the candidate pattern sets, therefore it needs less space during the mining task. And above algorithms focus on finding out the patterns which are adjacent and that may miss some hidden relationships among noncontinuous patterns. So the constraint of gap should be considered. In the paper [19], the author proposed an improved BIDE algorithm (Gap-BIDE) for mining closed sequential patterns with gap constraint and considers the patterns that are not only adjacent but also noncontiguous; Gap-BIDE algorithm had been applied to web mining in [20]. And in the previous work [21], we have improved the Gap-BIDE algorithm by discarding infrequent events before generating frequent candidate events and applying the improved algorithm to access pattern mining and discussed the efficient of parameter of the values of gap. In this paper, we perform the improved algorithm and compare the efficiency with previous access pattern mining algorithms, such as GSP algorithm.

The rest of this paper is organized as follows. Section 2 presents the precedent of our algorithm compared with the original algorithm. Section 3 focuses on discovering access patterns, namely, preprocessing, pattern discovery, and result analysis, and it focuses on the efficiency of the proposed approach in terms of access pattern mining. In Section 4, we present an extensive performance study. Finally, we conclude this study in Section 5.

2. Algorithm of Improved Gap-BIDE

2.1. Gap-BIDE Algorithm

Gap-BIDE algorithm is presented in paper [19], and it inherits the same design philosophy as BIDE algorithm. It shares the same merit, that is, it does not need to maintain a candidate pattern set, which saves space consumption, and it can find some hidden relationships among the patterns that contend for the gap constraint.

The algorithm first finds the set of all frequent patterns, and it then mines the gap-constrained closed sequential patterns with pattern as the prefix. In this process, it first scans the backward spaces of prefix pattern , uses the gap-constrained backscan pruning method to prune search space, scans the forward spaces of prefix , and uses the gap-constrained pattern closure checking scheme to check whether or not pattern is closed; finally, it scans each forward space of all appearances of pattern and finds the set of all locally frequent items, , uses each item in to extend , and mines the gap-constrained closed sequential patterns for the new prefix by calling subroutine again.

In the algorithm, forward space is defined as that given an appearance of pattern P[M, N] with triple (sid, beginPos, and endPos). The forward space of appearance is part of the sequence of range [endPos + , endPos + ] [endPos, ), where is the length of sequence sid. Here, the definition of forward space (FS) is induced for getting frequent subsequence patterns. We can get the sequence support of every subsequence by scanning the forward spaces of the appearances of a prefix pattern. The sequences whose supports are greater than or equal to the minimal support threshold Minsup will be the frequent subsequences patterns of a prefix pattern.

The definition of backward space (BS) is important, and it is defined as that given an appearance of pattern with triple (sid, beginPos, and end-Pos). The backward space of appearance is part of the sequence sid that is of the range [beginPos − , beginPos − ] .

Performance of proposed approach shows that Gap-BIDE is both runtime and space efficient in mining frequent, closed sequences with gap constraints.

2.2. Improved Gap-BIDE Algorithm

Although Gap-BIDE algorithm is advanced in the algorithms of sequential pattern mining, there are still a lot of fool’s errands are done during the mining task, such as generating some candidate patterns for infrequent events in the original data set. To avoid the unnecessary memory use, an improved algorithm is proposed. Our algorithm is designed based on the Gap-BIDE algorithm; the main idea is to discard infrequent events before generating frequent candidate events; we call this process as getting a large event set.

Algorithm 1 is the main algorithm. The Algorithm 2 is a subroutine of Algorithm 1; it proposes the process of getting a large event set. A large event set (LES) is an event set that contains the events that satisfy a user specified minimum support threshold. The events in LES represent the transactions or objects with large proportion in the entire data set. In this paper, a web log file denotes the data set, and one web page is defined as an event; thus, LES denotes the set of web pages that are accessed by web users with enough frequency in a period of time. In this mining process, the generation sequence through LES can reduce the number of test data to improve the efficiency and accuracy of the mining task. After obtaining large event set, sequence data with only large events are generated. Then the algorithm scans the generated database, finds the set of all frequent items with length (length-1), and calls Algorithm 3 iteratively. Algorithm 3 patternGrowth ( ) is the other subroutine of Algorithm 1; it proposes the process to mine the gap-constrained closed sequential patterns with pattern as the prefix.

ALGORITHM: gap-Bide (SDB, _session, min_sup_les,
min_sup, M, N)
INPUT: (1) SDB: An input sequence database with time, (2)
_session: the time user session, (3) min_sup_les: the
minimum support threshold of getting large event set, (4)
min_sup: the minimum support threshold of getting closed
sequential pattern, (5) and : the parame-ters of a gap
constraint.
OUTPUT: the set of gap-constrained closed sequential patterns.
(1) call getLargeEventSet (SDB, _session, min_sup_les);
(2) select sequence from input database only contained in LES
(3) find the set of length-1 frequent sequential patterns, 1;
(4) for each item in 1
(5) call patternGrowth( );
(6) return

ALGORITHM: getLargeEventSet (SDB, _session,
min_sup_les)
INPUT: (1) SDB: An input sequence database with time, (2)
_session: the time user session, (3) min_sup_les: the
minimum support threshold of getting large event set.
OUTPUT: LES: large event set.
(7) scan sequence database; find all candidate events [ ,
,…, ]
(8) group sequences by IP address and -session; find all
sessions [ 1, 2,…, ]
(9) for each candidate event in session
(10) calculate support for
(11) if (support of ≥ min_sup_les)
(12) output event to LES
(13) return

ALGORITHM: patternGrowth ( )
INPUT: (1) : prefix sequence pattern.
OUTPUT: the set of gap-constrained closed sequential
patterns with prefix .
(14) backward_check (P needPruning, hasBackwardExtension)
(15) if (needPruning)
(16) return;
(17) forward_check( , hasForwardExtension);
(18) if ! (hasBackwardExtension ∣∣ hasForwardExtension)
(19) output pattern ;
(20) search each forward space of all appearances of , and
 find the set of all local frequent items, ;
(21) for each item in
(22) build new pattern = + ;
(23) call patternGrowth ( );
(24) return.

An important definition for generating LES is the user session. The user session is an activity that a user with a unique IP address spends on a web page during a specified period of time. It can be used to identify a continuous access to user statistics visits by this measure. The specified period of time is determined via a cookie, also known as web cookie and HTTP cookie, which can be set by the server with or without an expiration date, modified by web designer and is set to a default value of 600 seconds. Within the expiration date, the access of web user is effective.

3. Discovery of Access Patterns

In this section, the process of mining task is discussed.

3.1. Data Preprocessing

Web log files reside on the web servers that record the activities of clients who access the web server via a web browser. Traditionally, there have been many types of web log files including error logs, access logs, and referrer logs. In this paper, data in the web access log is defined as the raw data. The web access log records all requests that are processed by the web server. Data in the log file contains some missing value data and irrelevant attributes; it cannot be directly used for the mining task. In this section, we describe the process of data cleaning and attribute selection to remove unwanted data.(1)Data cleaning: removing irrelevant data.(a)Remove the records with URLs of jpg, png, gif, js, css, and so on, which are automatically generated when a web page is requested.(b)Remove the data with wrong statue numbers that start with the numbers 4 or 5. These wrong records are caused by the error of requests or server. For example, the HTTP client error: 400 Bad Request and 404 Not Found and HTTP server error: 500 Internal Server Error and 505 HTTP Version Not Supported.(c)Discard missing value data that are caused by breaking a web page while loading.(2)Attribute selection: removing the irrelevant attributes. There are many attributes in one record of web log file. In this paper, we need the attributes of IP Address, Time, and URL; thus, the rest of attributes of method, status, size, and so on, need to be discarded.(3)Transformed URLs into code numbers.

It is difficult to distinguish the requested URLs of web log data in thousands of records. There are typically dozens of kinds of web pages in thousands of records. So, the URLs can be transformed into code numbers for simplicity. For example, a web log data that comes from the server of website http://www.vtsns.edu.rs/, and there are 31 different kinds of web pages that have been accessed. We transform their URLs into code numbers, such as galerija.php → 1, nenastavno_osoblje.php → 15, and rezultati_ispita.php → 21.

We choose a set of data from a web log file as an example data. After data preprocessing, we get the clean data shown in Table 1.

3.2. Process of Discovering Access Patterns

In this section, we present the process of discovering access patterns with an example.

After data preprocessing, we apply the algorithm to web log data. Then, LES is generated with sorting the data in Table 1 by the attributes of IP Address and Time; here, the time of user session is defined as one hour for simplicity. Then, these data are grouped by one hour for each web user; finally, the sorted data is shown in Table 2.

Then, we calculate the support of each event. For example, for the event , it occurs three times, which are in “82.117.202.158” at time 2, in “82.208.207.41” at time 2, and in “82.208.255.125” at time 2. After calculating of events support, the candidate event set is obtained as shown in Table 3.

Finally, a user specified minimum support threshold (MinSup) must be defined. MinSup denotes a kind of abstract level that is a degree of generalization. Choosing MinSup is very important; if it is low, then we can get a detailed event. If it is high, then we can get general events. In this example, MinSup is defined as 75%. In other words, if a web page is accessed by greater than or equal to 75% web users, then this web page can be denoted as a large event. After the process of getting large event set, the LES is obtained as shown in Table 4.

After obtaining LES, the infrequent events , , and are removed from Table 2, and the events are then transformed into a set of tuples (sequence identifier, sequence). We define the IP Address as the sequence identifier and define the event as a sequence. The sequence set is shown in Table 5.

Then, we call the original Gap-BIDE algorithm to find the frequent sequential pattern and prune the patterns. Here, gap is defined as g(M, N), where is the value of minimum gap, and is the value of the maximum gap. Assume a pattern with g(M, N), which can be expressed as P[M, N]. This approach is presented like the description of timing constrains with the mingap and maxgap. If the value of M-N is , then the events in a sequence must occur within of the events occurring in the previous event.

After calling our improved algorithm, we get the closed patterns as shown in Table 6.

Useful information can be found from the experimental result. The relationships of web pages are known easily, and user behavior information is shown directly. Each number in the output sequential patterns represents a website or a web user request. For example, the numbers 6 and 7 represent web pages ispit_raspored_god.php and upis_prva.php, respectively. For the closed sequential pattern [6, 7] shown in Table 6, it means 75% (3 out of 4 user sessions) of the web users who access web page upis_prva.php tend to always visit web page ispit_raspored_god.php first. According to the relationship between these two web pages, the design of web pages can be improved. For example, the web designer can add a hyperlink into web page ispit_raspored_god.php that points to web page upis_prva.php. This approach can be applied in many areas. For instance, in the electronic shopping cart, when customers complete their shopping, there can be some hyperlinks in the finished web page that point to some related web pages according to the mining result of purchase history. When web users watch a movie, some hyperlinks that point to some web pages of related movies on the site must be present.

4. Experimental Result and Analysis

4.1. Effect of Parameter in the Process of Getting Large Event Set

The process of getting a large event set aims at extracting the events that satisfy a user defined minimum support of large event set. It can discard the infrequent events to reduce the size of experimental database for reducing the search space and time and maintaining the accuracy of the whole process of mining task. To evaluate the parameter effect, we compare the numbers of large events by changing the values of the minimum support of large event set (MSLE). In this experiment, the experimental data records the access information of website (http://www.vtsns.edu.rs/), which is an institution’s official website. The number of original records in the web log file is 5999, and after data preprocessing, there are 269 user sessions in the records. The experimental result is shown in Figure 1. We can see that the smaller the minimum support are, the more generalized the obtained LES becomes. There always exists a value of minimum support, and from the value, the number of large events will not change, or will change very little. This value is always selected to be used as the value of minimum support in the experiment.

4.2. Comparing with Original Gap-BIDE Algorithm

In this section, we compare our algorithm with the original Gap-BIDE algorithm [19]. The experimental data come from internet information server (IIS) logs for msnbc.com and news-related portions of msn.com for the entire day of September 28, 1999. Each sequence in the dataset corresponds to page views of a user during that twenty-four hour period. Each event in the sequence corresponds to a user’s request for a page. There are 989818 anonymous user sessions; we choose the test data by the approach of simple random sampling without replacement from these data. In the experiment, we define minimum support threshold of large event set as 20, minimum support of closed sequential pattern as 10, and the value of gap as . We implemented the experiment on a 2.40-GHz Pentium PC machine with 4.00 GB main memory and ran the algorithm in Python 2.7 with JDK 1.6.0. Then, the experimental result is shown in Figure 2. It shows that when applying our proposed algorithm, the cost of time is less than that of the original Gap-BIDE algorithm.

4.3. Comparing with GSP Algorithm

Previous studies have shown that our proposed algorithm is more effective than original Gap-BIDE algorithm when we apply the algorithms on discovering access patterns. In this section, we want to prove that our proposed algorithm is more effective than previous access pattern mining algorithm. To validate it, we compare our algorithm and GSP algorithm proposed in [17] with an experiment. The experimental data come from Internet information server (IIS) logs for msnbc.com and news-related portions of msn.com for the entire day of September 28, 1999, and we choose the test data by the approach of simple random sampling without replacement from these data. In the experiment, we define minimum support of closed sequential pattern as 10 and the experimental result is shown in Figure 3. It shows that when applying our proposed algorithm to large database, the cost of time is less than that GSP algorithm.

5. Conclusion

In this paper, we presented the application of improved Gap-BIDE algorithm for discovering closed sequential patterns in web log data. We improve the algorithm by discarding all infrequent events before generating the frequent candidate events. In the process of data preprocessing, we removed the irrelevant attributes and transformed URLs into code numbers for simplicity, and we removed the missing value data to improve the quality of data. For getting experimental data for the mining task, we transformed the web log data into sequences based on the time constraint. The value of time is determined by an expiration date of the cookies. As a result, we obtained new web access patterns that expressed the order in which websites were access based on the Gap-BIDE algorithm. Compared with the previous web mining approaches, the proposed approach achieves the best performance in terms of getting a large event set of sequence. It reduces the sequences to get more effective and accurate results. We performed some experiments to compare our algorithm with previous algorithms. The experiments show that our algorithm uses less time than the original Gap-BIDE algorithm and cost less time than GSP algorithm in discovering access patterns in large database. In future work, we will try to find a more efficient algorithm for mining the closed gap constraint sequential patterns and will try to achieve a more efficient way for transforming web log files into sequence patterns.

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (no. 2012-0000478).