Skip to main content

2002 | Buch

Advances in Knowledge Discovery and Data Mining

6th Pacific-Asia Conference, PAKDD 2002 Taipei, Taiwan, May 6–8, 2002 Proceedings

herausgegeben von: Ming-Syan Chen, Philip S. Yu, Bing Liu

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Knowledge discovery and data mining have become areas of growing significance because of the recent increasing demand for KDD techniques, including those used in machine learning, databases, statistics, knowledge acquisition, data visualization, and high performance computing. In view of this, and following the success of the five previous PAKDD conferences, the sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2002) aimed to provide a forum for the sharing of original research results, innovative ideas, state-of-the-art developments, and implementation experiences in knowledge discovery and data mining among researchers in academic and industrial organizations. Much work went into preparing a program of high quality. We received 128 submissions. Every paper was reviewed by 3 program committee members, and 32 were selected as regular papers and 20 were selected as short papers, representing a 25% acceptance rate for regular papers. The PAKDD 2002 program was further enhanced by two keynote speeches, delivered by Vipin Kumar from the Univ. of Minnesota and Rajeev Rastogi from AT&T. In addition, PAKDD 2002 was complemented by three tutorials, XML and data mining (by Kyuseok Shim and Surajit Chadhuri), mining customer data across various customer touchpoints at- commerce sites (by Jaideep Srivastava), and data clustering analysis, from simple groupings to scalable clustering with constraints (by Osmar Zaiane and Andrew Foss).

Inhaltsverzeichnis

Frontmatter

Industrial Papers (Invited)

Network Data Mining and Analysis: The Project

Modern communication networks generate large amounts of operational data, including traffic and utilization statistics and alarm/fault data at various levels of detail. These massive collections of network-management data can grow in the order of several Terabytes per year, and typically hide “knowledge” that is crucial to some of the key tasks involved in effectively managing a communication network (e.g., capacity planning and traffic engineering). In this short paper, we provide an overview of some of our recent and ongoing work in the context of the $$ \mathcal{N}\mathcal{E}\mathcal{M}\mathcal{E}\mathcal{S}\mathcal{I}\mathcal{S} $$ project at Bell Laboratories that aims to develop novel data warehousing and mining technology for the effective storage, exploration, and analysis of massive network-management data sets. We first give some highlights of our work on Model-Based Semantic Compression (MBSC), a novel data-compression framework that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive network-data tables. We discuss the architecture and some of the key algorithms underlying $$ \mathcal{S}\mathcal{P}\mathcal{A}\mathcal{R}\mathcal{T}\mathcal{A}\mathcal{N} $$ , a model-based semantic compression system that exploits predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. We also summarize some of our ongoing work on warehousing and analyzing network-fault data and discuss our vision of how data-mining techniques can be employed to help automate and improve fault-management in modern communication networks. More specifically, we describe the two key components of modern fault-management architectures, namely the event-correlation and the root-cause analysis engines, and propose the use of mining ideas for the automated inference and maintenance of the models that lie at the core of these components based on warehoused network data.

Minos Garofalakis, Rajeev Rastogi
Privacy Preserving Data Mining: Challenges and Opportunities

The goal of privacy preserving data mining is to develop accurate models without access to precise information in individual data records, thus finessing the conflict between privacy and data mining. In this talk, I will give an introduction to the techniques underlying privacy preserving data mining, and then discuss several application domains. In particular, recent events have led to an increased interest in applying data mining toward security related problems, leading to interesting technical challenges at the intersection of privacy, security and data mining.

Ramakrishnan Srikant

Survey Papers (Invited)

A Case for Analytical Customer Relationship Management

The Internet has emerged as a low cost, low latency and high bandwidth customer communication channel. Its interactive nature provides an organization the ability to enter into a close, personalized dialog with individual customers. The simultaneous maturation of data management technologies like data warehousing, and data mining, have created the ideal environment for making customer relationship management (CRM) a much more systematic effort than it has been in the past. In this paper we described how data analytics can be used to make various CRM functions like customer segmentation, communication targeting, retention, and loyalty much more effective. We briefly describe the key technologies needed to implement analytical CRM, and the organizational issues that must be carefully handled to make CRM a reality. Our goal is to illustrate problems that exist with current CRM efforts, and how using data analytics techniques can address them. Our hope is to get the data mining community interested in this important application domain.

Jaideep Srivastava, Jau-Hwang Wang, Ee-Peng Lim, San-Yih Hwang
On Data Clustering Analysis: Scalability, Constraints, and Validation

Clustering is the problem of grouping data based on similarity. While this problem has attracted the attention of many researchers for many years, we are witnessing a resurgence of interest in new clustering techniques. In this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of clustering validation.

Osmar R. Zaïane, Andrew Foss, Chi-Hoon Lee, Weinan Wang

Association Rules (I)

Discovering Numeric Association Rules via Evolutionary Algorithm

Association rules are one of the most used tools to discover relationships among attributes in a database. Nowadays, there are many efficient techniques to obtain these rules, although most of them require that the values of the attributes be discrete. To solve this problem, these techniques discretize the numeric attributes, but this implies a loss of information. In a general way, these techniques work in two phases: in the first one they try to find the sets of attributes that are, with a determined frequency, within the database (frequent itemsets), and in the second one, they extract the association rules departing from these sets. In this paper we present a technique to find the frequent itemsets in numeric databases without needing to discretize the attributes. We use an evolutionary algorithm to find the intervals of each attribute that conforms a frequent itemset. The evaluation function itself will be the one that decide the amplitude of these intervals. Finally, we evaluate the tool with synthetic and real databases to check the efficiency of our algorithm.

Jacinto Mata, José-Luis Alvarez, José-Cristobal Riquelme
Efficient Rule Retrieval and Postponed Restrict Operations for Association Rule Mining

Knowledge discovery in databases is a complex, iterative, and highly interactive process. When mining for association rules, typically interactivity is largely smothered by the execution times of the rule generation algorithms. Our approach is to accept a single, possibly expensive run, but all subsequent mining queries are supposed to be answered interactively by accessing a sophisticated rule cache. However there are two critical aspects. First, access to the cache must be efficient and comfortable. Therefore we enrich the basic association mining framework by descriptions of items through application dependent attributes. Furthermore we extend current mining query languages to deal with these attributes through ∃ and ∀ quantifiers. Second, the cache must be prepared to answer a broad variety of queries without rerunning the mining algorithm. A main contribution of this paper is that we show how to postpone restrict operations on the transactions from rule generation to rule retrieval from the cache. That is, without actually rerunning the algorithm, we efficiently construct those rules from the cache that would have been generated if the mining algorithm were run on only a subset of the transactions. In addition we describe how we implemented our ideas on a conventional relational database system. We evaluate our prototype concerning response times in a pilot application at DaimlerChrysler. It turns out to satisfy easily the demands of interactive data mining.

Jochen Hipp, Christoph Mangold, Ulrich Güntzer, Gholamreza Nakhaeizadeh
Association Rule Mining on Remotely Sensed Images Using P-trees

Association Rule Mining, originally proposed for market basket data, has potential applications in many areas. Remote Sensed Imagery (RSI) data is one of the promising application areas. Extracting interesting patterns and rules from datasets composed of images and associated ground data, can be of importance in precision agriculture, community planning, resource discovery and other areas. However, in most cases the image data sizes are too large to be mined in a reasonable amount of time using existing algorithms. In this paper, we propose an approach to derive association rules on RSI data using Peano Count Tree (P-tree) structure. P-tree structure, proposed in our previous work, provides a lossless and compressed representation of image data. Based on P-trees, an efficient association rule mining algorithm P-ARM with fast support calculation and significant pruning techniques are introduced to improve the efficiency of the rule mining process. P-ARM algorithm is implemented and compared with FP-growth and Apriori algorithms. Experimental results showed that our algorithm is superior for association rule mining on RSI spatial data.

Qin Ding, Qiang Ding, William Perrizo
On the Efficiency of Association-Rule Mining Algorithms

In this paper, we first focus our attention on the question of how much space remains for performance improvement over current association rule mining algorithms. Our strategy is to compare their performance against an “Oracle algorithm” that knows in advance the identities of all frequent itemsets in the database and only needs to gather their actual supports to complete the mining process. Our experimental results show that current mining algorithms do not perform uniformly well with respect to the Oracle for all database characteristics and support thresholds. In many cases there is a substantial gap between the Oracle’s performance and that of the current mining algorithms. Second, we present a new mining algorithm, called ARMOR, that is constructed by making minimal changes to the Oracle algorithm. ARMOR consistently performs within a factor of two of the Oracle on both real and synthetic datasets over practical ranges of support specifications.

Vikram Pudi, Jayant R. Haritsa

Classification (I)

A Function-Based Classifier Learning Scheme Using Genetic Programming

Classification is an important research topic in knowledge discovery and data mining. Many different classifiers have been motivated and developed of late years. In this paper, we propose an effective scheme for learning multicategory classifiers based on genetic programming. For a k-class classification problem, a training strategy called adaptive incremental learning strategy and a new fitness function are used to generate k discriminant functions. We urge the discriminant functions to map the domains of training data into a specified interval, and thus data will be assigned into one of the classes by the values of functions. Furthermore, a Z-value measure is developed for resolving the conflicts. The experimental results show that the proposed GP-based classification learning approach is effective and performs a high accuracy of classification.

Jung-Yi Lin, Been-Chian Chien, Tzung-Pei Hong
SNNB: A Selective Neighborhood Based Naïve Bayes for Lazy Learning

Naïve Bayes is a probability-based classification method which is based on the assumption that attributes are conditionally mutually independent given the class label. Much research has been focused on improving the accuracy of Naïve Bayes via eager learning. In this paper, we propose a novel lazy learning algorithm, Selective Neighbourhood based Naïve Bayes (SNNB). SNNB computes different distance neighborhoods of the input new object, lazily learns multiple Naïve Bayes classifiers, and uses the classifier with the highest estimated accuracy to make decision. The results of our experiments on 26 datasets show that our proposed SNNB algorithm is efficient and it outperforms Naïve Bayes, and state-of-the-art classification methods NBTree, CBA, and C4.5 in terms of accuracy.

Zhipeng Xie, Wynne Hsu, Zongtian Liu, Mong Li Lee
A Method to Boost Naïve Bayesian Classifiers

In this paper, we introduce a new method to improve the performance of combining boosting and naïve Bayesian. Instead of combining boosting and Naïve Bayesian learning directly, which was proved to be unsatisfactory to improve performance, we select the training samples dynamically by bootstrap method for the construction of naïve Bayesian classifiers, and hence generate very different or unstable base classifiers for boosting. Besides, we devise a modification for the weight adjusting of boosting algorithm in order to achieve this goal: minimizing the overlapping errors of its constituent classifiers. We conducted series of experiments, which show that the new method not only has performance much better than naïve Bayesian classifiers or directly boosted naïve Bayesian ones, but also much quicker to obtain optimal performance than boosting stumps and boosting decision trees incorporated with naïve Bayesian learning.

Lili Diao, Keyun Hu, Yuchang Lu, Chunyi Shi
Toward Bayesian Classifiers with Accurate Probabilities

In most data mining applications, accurate ranking and probability estimation are essential. However, many traditional classifiers aim at a high classification accuracy (or low error rate) only, even though they also produce probability estimates. Does high predictive accuracy imply a better ranking and probability estimation? Is there any better evaluation method for those classifiers than the classification accuracy, for the purpose of data mining applications? The answer is the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC. We show that AUC provides a more discriminating evaluation for the ranking and probability estimation than the accuracy does. Further, we show that classifiers constructed to maximise the AUC score produce not only higher AUC values, but also higher classification accuracies. Our results are based on experimental comparison between error-based and AUC-based learning algorithms for TAN (Tree-Augmented Naive Bayes).

Charles X. Ling, Huajie Zhang

Interestingness

Pruning Redundant Association Rules Using Maximum Entropy Principle

Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of efficiency of Maximum Entropy computations is addressed by using closed form solutions for the most frequent cases. Analytical and experimental evaluation of the proposed technique indicates that it efficiently produces small sets of interesting association rules.

Szymon Jaroszewicz, Dan A. Simovici
A Confidence-Lift Support Specification for Interesting Associations Mining

Recently, the weakness of the canonical support-confidence framework for associations mining has been widely studied in the literature. One of the difficulties in applying association rules mining to real world applications is the setting of support constraint. A high support constraint avoids the combinatorial explosion in discovering frequent itemsets, but at the expense of missing interesting patterns of low support. Instead of seeking the way for setting the appropriate support constraint, all current approaches leave the users in charge of the support setting, which, however, puts the users in a dilemma. This paper is an effort to answer this long-standing open question. Based on the notion of confidence and lift measures, we propose an automatic support specification for mining high confidence and positive lift associations without consulting the users. Experimental results show that this specification is good at discovering the low support, but high confidence and positive lift associations, and is effective in reducing the spurious frequent itemsets.

Wen-Yang Lin, Ming-Cheng Tseng, Ja-Hwung Su
Concise Representation of Frequent Patterns Based on Generalized Disjunction-Free Generators

Frequent patterns are often used for solving data mining problems. They are applied e.g. in discovery of association rules, episode rules, sequential patterns and clusters. Nevertheless, the number of frequent itemsets is usually huge. In the paper, we overview briefly four lossless representations of frequent itemsets proposed recently and offer a new lossless one that is based on generalized disjunction-free generators. We prove on theoretical basis that the new representation is more concise than three of four preceding representations. In practice it is much more concise than the fourth representation too. An algorithm determining the new representation is proposed.

Marzena Kryszkiewicz, Marcin Gajek
Mining Interesting Association Rules: A Data Mining Language

Mining association rules is to discover customer purchasing behaviors from a transaction database, such that the quality of business decision can be improved. However, the size of the transaction database can be very large. It is very time consuming to find all the association rules from a large database, and users may be only interested in some information. Hence, a data mining language needs to be provided such that users can query only interesting knowledge to them from a large database of customer transactions. In this paper, a data mining language is presented. From the data mining language, users can specify the interested items and the criteria of the association rules to be discovered. Also, the efficient data mining techniques are proposed to extract the association rules according to the user requirements.

Show-Jane Yen, Yue-Shi Lee
The Lorenz Dominance Order as a Measure of Interestingness in KDD

Ranking summaries generated from databases is useful within the context of descriptive data mining tasks where a single data set can be generalized in many different ways and to many levels of granularity. Our approach to generating summaries is based upon a data structure, associated with an attribute, called a domain generalization graph (DGG). A DGG for an attribute is a directed graph where each node represents a domain of values created by partitioning the original domain for the attribute, and each edge represents a generalization relation between these domains. Given a set of DGGs associated with a set of attributes, a generalization space can be defined as all possible combinations of domains, where one domain is selected from each DGG for each combination. This generalization space describes, then, all possible summaries consistent with the DGGs that can be generated from the selected attributes. When the number of attributes to be generalized is large or the DGGs associated with the attributes are complex, the generalization space can be very large, resulting in the generation of many summaries. The number of summaries can easily exceed the capabilities of a domain expert to identify interesting results. In this paper, we show that the Lorenz dominance order can be used to rank the summaries prior to presentation to the domain expert. The Lorenz dominance order defines a partial order on the summaries, in most cases, and in some cases, defines a total order. The rank order of the summaries represents an objective evaluation of their relative interestingness and provides the domain expert with a starting point for further subjective evaluation of the summaries.

Robert J. Hilderman

Sequence Mining

Efficient Algorithms for Incremental Update of Frequent Sequences

Most of the works proposed so far on mining frequent sequences assume that the underlying database is static. However, in real life, the database is modified from time to time. This paper studies the problem of incremental update of frequent sequences when the database changes. We propose two efficient incremental algorithms GSP+ and MFS+. Throught experimetns, we compare the performance of GSP+ and MFS+ with GSP and MFS — two efficient algorithms for mining frequent sequences. We show that GSP+ and MFS+ effectively reduce the CPU costs of their counterparts with only a small or even negative additional expense on I/O cost.

Minghua Zhang, Ben Kao, David Cheung, Chi-Lap Yip
DELISP: Efficient Discovery of Generalized Sequential Patterns by Delimited Pattern-Growth Technology

An active research in data mining is the discovery of sequential patterns, which finds all frequent sub-sequences in a sequence database. Most of the studies specify no time constraints such as maximum/minimum gaps between adjacent elements of a pattern in the mining so that the resultant patterns may be uninteresting. In addition, a data sequence containing a pattern is rigidly defined as only when each element of the pattern is contained in a distinct element of the sequence. This limitation might lose useful patterns for some applications because sometimes items of an element might be spread across adjoining elements within a specified time period or time window. Therefore, we propose a pattern-growth approach for mining the generalized sequential patterns. Our approach features in reducing the size of sub-databases by bounded and windowed projection techniques. Bounded projections keep only time-gap valid sub-sequences and windowed projections save non-redundant sub-sequences satisfying the sliding time window constraint. Furthermore, the delimited growth technique directly generates constraint-satisfactory patterns and speeds up the growing process. The empirical evaluations show that the proposed approach has good linear scalability and outperforms the well-known GSP algorithm in the discovery of generalized sequential patterns.

Ming-Yen Lin, Suh-Yin Lee, Sheng-Shun Wang
Self-Similarity for Data Mining and Predictive Modeling A Case Study for Network Data

Recently there are a handful study and research on observing self-similarity and fractals in natural structures and scientific database such as traffic data from networks. However, there are few works on employing such information for predictive modeling, data mining and knowledge discovery. In this paper we study, analyze our experiments and observation of self-similar structure embedded in Network data for prediction through Self Similar Layered Hidden Markov Model (SSLHMM). SSLHMM is a novel alternative of Hidden Markov Models (HMM) which proven to be useful in a variety of real world applications. SSLHMM leverage HMM power and extend such capability to self-similar structures and exploit this property to reduce the complexity of predictive modeling process. We show that SSLHMM approach can captures self-similar information and provides more accurate and interpretable model comparing to conventional techniques.

Jafar Adibi, Wei-Min Shen, Eaman Noorbakhsh
A New Mechanism of Mining Network Behavior

In this work, a new mechanism, which consists of Preprocessing Phase, Two-Layer Pattern Discovering Phase (2LPD), and Pattern Explanation Phase, is proposed to discover unknown patterns. Two heuristics are proposed to detect outlier users in 2LPD Phase. Next, we are also concerned about subsequences of user’s behaviors in this phase. As the patterns which are previously unknown have been discovered in 2LPD Phase, they will be incrementally feedbacked to knowledge base for further detection. Through this incremental learning mechanism, the known patterns can be increased.

Shun-Chieh Lin, Shian-Shyong Tseng, Yao-Tsung Lin

Clustering

M-FastMap: A Modified FastMap Algorithm for Visual Cluster Validation in Data Mining

This paper presents M-FastMap, a modified FastMap algorithm for visual cluster validation in data mining. In the visual cluster validation with FastMap, clusters are first generated with a clustering algorithm from a database. Then, the FastMap algorithm is used to project the clusters onto a 2-dimensional (2D) or 3-dimensional (3D) space and the clusters are visualized with different colors and/or symbols on a 2D (or 3D) display. From the display a human can visually examine the separation of clusters. This method follows the principle that if a cluster is separate from others in the projected 2D (or 3D) space, it is also separate from others in the original high dimensional space (the opposite is not true). The modified FastMap algorithm improves the quality of visual cluster validation by optimizing the separation of clusters on the 2D or (3D) space in the selection of pivot objects (or projection axis). The comparison study has shown that the modified FastMap algorithm can produce better visualization results than the original FastMap algorithm.

Michael Ng, Joshua Huang
An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory

One of the main challenges in the design of modern clustering algorithms is that, in many applications, new data sets are continuously added into an already huge database. As a result, it is impractical to carry out data clustering from scratch whenever there are new data instances added into the database. One way to tackle this challenge is to incorporate a clustering algorithm that operates incrementally. Another desirable feature of clustering algorithms is that a clustering dendrogram is generated. This feature is crucial for many applications in biological, social, and behavior studies, due to the need to construct taxonomies. This paper presents the GRIN algorithm, an incremental hierarchical clustering algorithm for numerical data sets based on gravity theory in physics. The GRIN algorithm delivers favorite clustering quality and generally features O(n) time complexity. One main factor that makes the GRIN algorithm be able to deliver favorite clustering quality is that the optimal parameters settings in the GRIN algorithm are not sensitive to the distribution of the data set. On the other hand, many modern clustering algorithms suffer unreliable or poor clustering quality when the data set contains highly skewed local distributions so that no optimal values can be found for some global parameters. This paper also reports the experiments conducted to study the characteristics of the GRIN algorithm.

Chien-Yu Chen, Shien-Ching Hwang, Yen-Jen Oyang
Adding Personality to Information Clustering

This article presents a new information management method called user-configurable clustering that integrates the flexibility of clustering systems in handling novel data and the ease of use of categorization systems in providing structure. Based on a predictive self-organizing network that performs synchronized clustering of information and preference vectors, a user can influence the clustering of information vectors by encoding his/her preferences as preference vectors. We illustrate a sample session to show how a user may create and personalize an information portfolio according to his/her preferences and how the system discovers novel information groupings while organizing familiar information according to user-defined themes.

Ah-Hwee Tan, Hong Pan
Clustering Large Categorical Data

Clustering methods often come down to the optimization of a numeric criterion defined from a distance or from a dissimilarity measure. It is possible to show that this problem is often equivalent to the estimation of the parameters of a probabilistic model under the classification likelihood approach. For instance, we know that the inertia criterion optimized under the k-means algorithm corresponds to the hypothesis of a population arising from a Gaussian mixture. In this paper, we propose an adapted mixture model for categorical data. Using the classification likelihood approach, we develop the Classification EM algorithm (CEM) to estimate the parameters of the mixture model. With our probabilistic model, the data are not denatured and the estimated parameters readily indicate the characteristics of the clusters. This probabilistic approach gives an interpretation of the criterion optimized by the k-modes algorithm which is an extension of k-means to categorical attributes and allows us to study the behavior of this algorithm.

François-Xavier Jollois, Mohamed Nadif

Web Mining

WebFrame: In Pursuit of Computationally and Cognitively Efficient Web Mining

The goal of web mining is relatively simple: provide both computationally and cognitively efficient methods for improving the value of information to users of the WWW. The need for computational efficiency is well-recognized by the data mining community, which sprung from the database community concern for efficient manipulation of large datasets. The motivation for cognitive efficiency is more elusive but at least as important. In as much as cognitive efficiency can be informally construed as ease of understanding, then what is important is any tool or technique that presents cognitively manageable abstractions of large datasets.We present our initial development of a framework for gathering, analyzing, and redeploying web data. Not dissimilar to conventional data mining, the general idea is that good use of web data first requires the careful selection of data (both usage and content data), the deployment of appropriate learning methods, and the evaluation of the results of applying the results of learning in a web application. Our framework includes tools for building, using, and visualizing web abstractions.We present an example of the deployment of our framework to navigation improvement. The abstractions we develop are called Navigation Compression Models (NCMs), and we show a method for creating them, using them, and visualizing them to aid in their understanding.

Tong Zheng, Yonghe Niu, Randy Goebel
Naviz:Website Navigational Behavior Visualizer

Navigational behavior of website visitors can be extracted from web access log files with data mining techniques such as sequential pattern mining. Visualization of the discovered patterns is very helpful to understand how visitors navigate over the various pages on the site. Currently several web log visualization tools have been developed. However those tools are far from satisfactory. They do not provide global view of visitor access as well as individual traversal path effectively. Here we introduce Naviz, a system of interactive web log visualization that is designed to overcome those drawbacks. It combines two-dimensional graph of visitor access traversals that considers appropriate web traversal properties, i.e. hierarchization regarding traversal traffic and grouping of related pages, and facilities for filtering traversal paths by specifying visited pages and path attributes, such as number of hops, support and confidence. The tool also provides support for modern dynamic web pages. We apply the tool to visualize results of data mining study on web log data of Mobile Townpage, a directory service of phone numbers in Japan for i-Mode mobile internet users. The results indicate that our system can easily handle thousands of discovered patterns to discover interesting navigational behavior such as success paths, exit paths and lost paths.

Bowo Prasetyo, Iko Pramudiono, Katsumi Takahashi, Masaru Kitsuregawa
Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs

Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of new data structures. We present both theoretical and empirical analysis of the algorithms and prove that both algorithms have optimal time complexity.

Zhixiang Chen, Ada Wai-Chee Fu, Frank Chi-Hung Tong
Automatic Information Extraction for Multiple Singular Web Pages

The World Wide Web is now undeniably the richest and most dense source of information, yet its structure makes it difficult to make use of that information in a systematic way. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. IEPAD is proposed to automate wrapper generation from a multiple-record Web page without user-labeled examples. In this paper, we consider another case when multiple Web pages are available but each input Web page contains only one record (called singular Web pages). To solve this case, a hierarchical multiple string alignment is proposed to allow wrapper induction for multiple singular Web pages.

Chia-Hui Chang, Shih-Chien Kuo, Kuo-Yu Hwang, Tsung-Hsin Ho, Chih-Lung Lin

Association Rules (II)

An Improved Approach for the Discovery of Causal Models via MML

Discovering a precise causal structure accurately reflecting the given data is one of the most essential tasks in the area of data mining and machine learning. One of the successful causal discovery approaches is the information-theoretic approach using the Minimum Message Length Principle[19]. This paper presents an improved and further experimental results of the MML discovery algorithm. We introduced a new encoding scheme for measuring the cost of describing the causal structure. Stiring function is also applied to further simplify the computational complexity and thus works more efficiently. The experimental results of the current version of the discovery system show that: (1) the current version is capable of discovering what discovered by previous system; (2) current system is capable of discovering more complicated causal models with large number of variables; (3) the new version works more efficiently compared with the previous version in terms of time complexity.

Honghua Dai, Gang Li
SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset

In this paper, we propose the SETM*-MaxK algorithm to find the largest itemset based on a high-level set-based approach, where a large itemset is a set of items appearing in a sufficient number of transactions. The advantage of the set-based approach, like the SETM algorithm, is simple and stable over the range of parameter values. In the SETM*-MaxK algorithm, we efficiently find the L k based on L w , where L k denotes the set of large k-itemsets with minimum support, $$ L_k \ne \not 0,L_{k + 1} = \not 0{\mathbf{ }}and{\mathbf{ }}w = 2^{\left\lceil {log_2 k} \right\rceil - 1} $$ , instead of step by step. From our simulation, we show that the proposed SETM*-MaxK algorithm requires shorter time to achieve its goal than the SETM algorithm.

Ye-In Chang, Yu-Ming Hsieh
Discovery of Ordinal Association Rules

Most rule-interest measures are suitable for binary attributes and using an unsupervised usual algorithm for the discovery of association rules requires a transformation for other kinds of attributes. Given that the complexity of these algorithms increases exponentially with the number of attributes, this transformation can lead us, on the one hand to a combinatorial explosion, and on the other hand to a prohibitive number of weakly significant rules with many redundancies. To fill the gap, we propose in this study a new objective rule-interest measure called intensity of inclination which evaluates the implication between two ordinal attributes (numeric or ordinal categorical attributes). This measure allows us to extract a new kind of knowledge: ordinal association rules. An evaluation of an application to some banking data ends up the study.

Sylvie Guillaume
Value Added Association Rules

Value added product is an industrial term referring a minor addition to some major products. In this paper, we borrow the term to denote a minor semantic addition to the well known association rules. We consider the addition of numerical values to the attribute values, such as sale price, profit, degree of fuzziness, level of security and so on. Such additions lead to the notion of random variables (as added value to attributes) in the data model and hence the probabilistic considerations of data mining.

T. Y. Lin, Y. Y. Yao, E. Louie
Top Down FP-Growth for Association Rule Mining

In this paper, we propose an efficient algorithm, called TD-FP-Growth (the shorthand for Top-Down FP-Growth), to mine frequent patterns. TD-FP-Growth searches the FP-tree in the top-down order, as opposed to the bottom-up order of previously proposed FP-Growth. The advantage of the top-down search is not generating conditional pattern bases and sub-FP-trees, thus, saving substantial amount of time and space. We extend TD-FP-Growth to mine association rules by applying two new pruning strategies: one is to push multiple minimum supports and the other is to push the minimum confidence. Experiments show that these algorithms and strategies are highly effective in reducing the search space.

Ke Wang, Liu Tang, Jiawei Han, Junqiang Liu

Semi-structure & Concept Mining

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge labeled tree with ordered children which has structured variables. An edge label is a tag or a keyword in such Web documents, and a variable can be substituted by an arbitrary tree. So a tag tree pattern is suited for representing tree structured patterns in such Web documents. First we show that it is hard to compute the optimum frequent tag tree pattern. So we present an algorithm for generating all maximally frequent tag tree patterns and give the correctness of it. Finally, we report some experimental results on our algorithm. Although this algorithm is not efficient, experiments show that we can extract characteristic tree structured patterns in those data.

Tetsuhiro Miyahara, Yusuke Suzuki, Takayoshi Shoudai, Tomoyuki Uchida, Kenichi Takahashi, Hiroaki Ueda
Extracting Characteristic Structures among Words in Semistructured Documents

Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W1, W2,..., Wk) be a list of words which are sorted in lexicographical order and let k ≥ 2 be an integer. Firstly, we define a tree-association pattern on (W1, W2,..., Wk). A tree-association pattern on (W1, W2,..., Wk) is a sequence 〈t1; t2;...; tk-1〉 of labeled rooted trees such that, for i = 1, 2,..., k-1, (1) ti consists of only one node having the pair of two words Wi and Wi+1 as its label, or (2) ti is a labeled rooted tree which has just two leaves labeled with Wi and Wi+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.

Kazuyoshi Furukawa, Tomoyuki Uchida, Kazuya Yamada, Tetsuhiro Miyahara, Takayoshi Shoudai, Yasuaki Nakamura
An Efficient Algorithm for Incremental Update of Concept Spaces

The vocabulary problem in information retrieval arises because authors and indexers often use different terms for the same concept. A thesaurus defines mappings between different but related terms. It is widely used in modern information retrieval systems to solve the vocabulary problem. Chen et al. proposed the concept space approach to automatic thesaurus construction. A concept space contains the associations between every pair of terms. Previous research studies show that concept space is a useful tool for helping information searchers in revising their queries in order to get better results from information retrieval systems. The construction of a concept space, however, is very computationally intensive. In this paper, we propose and evaluate an efficient algorithm for the incremental update of concept spaces. In our model, only strong associations are maintained, since they are most useful in thesauri construction. Our algorithm uses a pruning technique to avoid computing weak associations to achieve efficiency.

Felix Cheung, Ben Kao, David Cheung, Chi-Yuen Ng

Data Warehouse and Data Cube

Efficient Constraint-Based Exploratory Mining on Large Data Cubes

Analysts often explore data cubes to identify anomalous regions that may represent problem areas or new opportunities. Discovery-driven exploration (proposed by S.Sarawagi et al [5]) automatically detects and marks the exceptions for the user and reduces the reliance on manual discovery. However, when the data is large, it is hard to materialize the whole cube due to the limitations of both space and time. So, exploratory mining on complete cube cells needs to construct the data cube dynamically. That will take a very long time. In this paper, we investigate optimization methods by pushing several constraints into the mining process. By enforcing several user-defined constraints, we first restrict the multidimensional space to a small constrained-cube and then mine exceptions on it. Two efficient constrained-cube construction algorithms, the NAIVE algorithm and the AGOA algorithm, were proposed. Experimental results indicate that this kind of constraint-based exploratory mining method is efficient and scalable.

Cuiping Li, Shengen Li, Shan Wang, Xiaoyong Du
Efficient Utilization of Materialized Views in a Data Warehouse

View Materialization is an effective method to increase query efficiency in a data warehouse. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. The purpose of this research is to select a proper set of materialized views under the storage and cost constraints and to help speedup the entire data warehousing process. We propose a cost model for data warehouse query and maintenance along with an efficient view selection algorithm, which uses the gain and loss indices. The main contribution of our paper is to speedup the selection process of materialized views. The second one is to reduce the total cost of data warehouse query and maintenance.

Don-Lin Yang, Man-Lin Huang, Ming-Chuan Hung

Bio-Data Mining

Mining Interesting Rules in Meningitis Data by Cooperatively Using GDT-RS and RSBR

This paper describes an application of two rough sets based systems, namely GDT-RS and RSBR respectively, for mining if-then rules in a meningitis dataset. GDT-RS (Generalized Distribution Table and Rough Set) is a soft hybrid induction system, and RSBR (Rough Sets with Boolean Reasoning) is used for discretization of real valued attributes as a preprocessing step realized before the GDT-RS starts. We argue that discretization of continuous valued attributes is an important pre-processing step in the rule discovery process. We illustrate the quality of rules discovered by GDT-RS is strongly affected by the result of discretization.

Ning Zhong, Juzhen Dong
Evaluation of Techniques for Classifying Biological Sequences

In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily. Our detailed experimental evaluation shows that the SVM-based approaches are able to achieve higher classification accuracy compared to the more traditional sequence classification algorithms such as Markov model based techniques and K-nearest neighbor based approaches.

Mukund Deshpande, George Karypis
Efficiently Mining Gene Expression Data via Integrated Clustering and Validation Techniques

In recent years, the microarray techniques have received extensive attentions due to its wide applications in biomédical industry. The main advantage of microarray technique is it allows simultaneous studies of the expressions of thousands of genes in a single experiment. Analyzing the microarray data is a challenge that arises the applications of various clustering methods used for data mining. Although a number of clustering methods have been proposed, they can not meet the requirements of automation, high quality and high efficiency at the same time in analyzing gene expression data. In this paper, we propose an automatic and efficient clustering approach for mining gene expression data produced via microarray techniques. Through performance experiments on real data sets, the proposed method is shown to achieve higher efficiency, clustering quality and automation than other clustering methods.

Vincent S. M. Tseng, Ching-Pin Kao

Classification (II)

Adaptive Generalized Estimation Equation with Bayes Classifier for the Job Assignment Problem

We propose combining advanced statistical approaches with data mining techniques to build classifiers to enhance decision-making models for the job assignment problem. Adaptive Generalized Estimation Equation (AGEE) approaches with Gibbs sampling under Bayesian framework and adaptive Bayes classifiers based on the estimations of AGEE models which uses modified Naive Bayes algorithm are proposed. The proposed classifiers have several important features. Firstly, it accounts for the correlation among the outputs and the indeterministic subjective noise into the estimation of parameters. Secondly, it reduces the number of attributes used to predict the class. Moreover, it drops the assumption of independence made by the Naive Bayes classifier. We apply our techniques to the problem of assigning jobs to Navy officers, with the goal of enhancing happiness for both the Navy and the officers. The classification results were compared with nearest neighbor, Multi-Layer Perceptron and Support Vector Machine approaches.

Yulan Liang, King-Ip Lin, Arpad Kelemen
GEC: An Evolutionary Approach for Evolving Classifiers

Using an evolutionary approach for evolving classifiers can simplify the classification task. It requires no domain knowledge of the data to be classified nor the requirement to decide which attribute to select for partitioning. Our method, called the Genetic Evolved Classifier (GEC), uses a simple structured genetic algorithm to evolve classifiers. Besides being able to evolve rules to classify data in to multi-classes, it also provides a simple way to partition continuous data into discrete intervals, i.e., transform all types of attribute values into enumerable types. Experiment results shows that our approach produces promising results and is comparable to methods like C4.5, Fuzzy-ID3 (F-ID3), and probabilistic models such as modified Naïve-Bayesian classifiers.

William W. Hsu, Ching-Chi Hsu
An Efficient Single-Scan Algorithm for Mining Essential Jumping Emerging Patterns for Classification

Emerging patterns (EPs), namely itemsets whose supports change significantly from one class to another, were recently proposed to capture multi-attribute contrasts between data classes, or trends over time. Previous studies show that EP/JEP (jumping emerging patterns) - based classifiers such as CAEP[2] and JEP-classifier[6] have good overall predictive accuracy. But they suffer from the huge number of mined EPs/JEPs, which makes the classifiers complex.In this study, we propose a special type of EP, essential jumping emerging patterns (eJEPs), which are believed to be high quality patterns with the most differentiating power and thus are sufficient for building accurate classifiers. Existing algorithms such as border-based algorithms and consEPMiner[7] can not directly mine such eJEPs. We present a new single-scan algorithm to effectively mine eJEPs of both data classes (both directions). Experimental results show that the classifier based exclusively on eJEPs, which uses much fewer JEPs than JEP-classifier, achieves the same or higher testing accuracy and is often also superior to other state-of-the-art classification systems such as C4.5 and CBA.

Hongjian Fan, Ramamohanarao Kotagiri
A Method to Boost Support Vector Machines

Combining boosting and Support Vector Machine (SVM) is proved to be beneficial, but it is too complex to be feasible. This paper introduces an efficient way to boost SVM. It embraces the idea of active learning to dynamically select “important” samples into training sample set for constructing base classifiers. This method maintains a small training sample set with settled size in order to control the complexity of each base classifier. Other than construct each base SVM classifier directly, it uses the training samples only for finding support vectors. This way to combine boosting and SVM is proved to be accurate and efficient by experimental results.

Lili Diao, Keyun Hu, Yuchang Lu, Chunyi Shi

Temporal Mining

Distribution Discovery: Local Analysis of Temporal Rules

In recent years, there has been increased interest in using data mining techniques to extract temporal rules from temporal sequences. Local temporal rules, which only a subsequence exhibits, are actually very common in practice. Efficient discovery of the time duration in which temporal rules are valid could benefit KDD of many real applications. In this paper, we present a novel problem class that is the discovery of the distribution of temporal rules. We simplify the mining problem and depict a model that could represent this knowledge clearly, uniquely and efficiently. Our methods include four online dividing strategies for different mining interest, an incremental algorithm for measuring rule-sets, and an algorithm for mining this knowledge. We have analyzed the behavior of the problem and our algorithms with both synthetic data and real data. The results correspond with the definition of our problem and reveal a kind of novel knowledge.

Xiaoming Jin, Yuchang Lu, Chunyi Shi
News Sensitive Stock Trend Prediction

Stock market prediction with data mining techniques is one of the most important issues to be investigated. In this paper, we present a system that predicts the changes of stock trend by analyzing the influence of non-quantifiableinformation (news articles). In particular, we investigate the immediate impact of news articles on the time series based on the Efficient Markets Hypothesis. Several data mining and text mining techniques are used in a novel way. A new statistical based piecewise segmentation algorithm is proposed to identify trends on the time series. The segmented trends are clustered into two categories, Rise and Drop, according to the slope of trends and the coefficient of determination. We propose an algorithm, which is called guided clustering, to filter news articles with the help of the clusters that we have obtained from trends. We also propose a new differentiated weighting scheme that assigns higher weights to the features if they occur in the Rise (Drop) news-article cluster but do not occur in its opposite Drop (Rise).

Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Wai Lam
User Profiling for Intrusion Detection Using Dynamic and Static Behavioral Models

Intrusion detection has emerged as an important approach to network security. In this paper, we adopt an anomaly detection approach by detecting possible intrusions based on user profiles built from normal usage data. In particular, user profiles based on Unix shell commands are modeled using two different types of behavioral models. The dynamic modeling approach is based on hidden Markov models (HMM) and the principle of maximum likelihood, while the static modeling approach is based on event occurrence frequency distributions and the principle of minimum cross entropy. The novelty detection approach is adopted to estimate the model parameters using normal training data only. To determine whether a certain behavior is similar enough to the normal model and hence should be classified as normal, we use a scheme that can be justified from the perspective of hypothesis testing. Our experimental results show that static modeling outperforms dynamic modeling for this application. Moreover, the static modeling approach based on cross entropy is similar in performance to instance-based learning reported previously by others for the same dataset but with much higher computational and storage requirements than our method.

Dit-Yan Yeung, Yuxin Ding

Classification (III)

Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.

Lee-Feng Chien, Chien-Kang Huang, Hsin-Chen Chiao, Shih-Jui Lin
k-nearest Neighbor Classification on Spatial Data Streams Using P-trees

Classification of spatial data streams is crucial, since the training dataset changes often. Building a new classifier each time can be very costly with most techniques. In this situation, k-nearest neighbor (KNN) classification is a very good choice, since no residual classifier needs to be built ahead of time. KNN is extremely simple to implement and lends itself to a wide variety of variations. We propose a new method of KNN classification for spatial data using a new, rich, data-mining-ready structure, the Peano-count-tree (P-tree). We merely perform some AND/OR operations on P-trees to find the nearest neighbors of a new sample and assign the class label. We have fast and efficient algorithms for the AND/OR operations, which reduce the classification time significantly. Instead of taking exactly the k nearest neighbors we form a closed-KNN set. Our experimental results show closed-KNN yields higher classification accuracy as well as significantly higher speed.

Maleq Khan, Qin Ding, William Perrizo
Interactive Construction of Classification Rules

We introduce an interactive classifier construction system, CVizT, in which the entire process is visualized based on a multidimensional visualization technique, Table Lens. The CVizT system is a fully interactive approach. The appropriate visualization-based interaction capabilities are provided for the user to include human perception into the construction process. Our experiments with data sets from the UCI repository demonstrates that the CVizT system is straightforward and easily learned. The user’s preference and domain knowledge can also be integrated into the construction process.

Jianchao Han, Nick Cercone

Outliers, Missing Data, and Causation

Enhancing Effectiveness of Outlier Detections for Low Density Patterns

Outlier detection is concerned with discovering exceptional behaviors of objects in data sets. It is becoming a growingly useful tool in applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, identifying computer intrusion, detecting health problems, etc. In this paper, we introduce a connectivity-based outlier factor (COF) scheme that improves the effectiveness of an existing local outlier factor (LOF) scheme when a pattern itself has similar neighbourhood density as an outlier. We give theoretical and empirical analysis to demonstrate the improvement in effectiveness and the capability of the COF scheme in comparison with the LOF scheme.

Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David W. Cheung
Cluster-Based Algorithms for Dealing with Missing Values

We first survey existing methods to deal with missing values and report the results of an experimental comparative evaluation in terms of their processing cost and quality of imputing missing values. We then propose three cluster-based mean-and-mode algorithms to impute missing values. Experimental results show that these algorithms with linear complexity can achieve comparative quality as sophisticated algorithms and therefore are applicable to large datasets.

Yoshikazu Fujikawa, TuBao Ho
Extracting Causation Knowledge from Natural Language Texts

SEKE2 is a semantic expectation-based knowledge extraction system for extracting causation relations from natural language texts. It is inspired by capitalizing the human behavior of analyzing information with semantic expectations. The framework of SEKE2 consists of different kinds of generic templates organized in a hierarchical fashion. All kinds of templates are domain independent. They are robust and enable flexible changes for different domains and expected semantics. By associating a causation semantic template with a set of sentence templates, SEKE2 can extract causation knowledge from complex sentences without full-fledged syntactic parsing. To demonstrate the flexibility of SEKE2 for different domains, we study the application of causation semantic templates on two domain areas of news stories, namely, Hong Kong stock market movement and global warming.

Ki Chan, Boon-Toh Low, Wai Lam, Kai-Pui Lam
Mining Relationship Graphs for Effective Business Objectives

Modern organization has two types of customer profiles: active and passive. Active customers contribute to the business goals of an organization, while passive customers are potential candidates that can be converted to active ones. Existing KDD techniques focused mainly on past data generated by active customers. The insights discovered apply well to active ones but may scale poorly with passive customers. This is because there is no attempt to generate know-how to convert passive customers into active ones. We propose an algorithm to discover relationship graphs using both types of profile. Using relationship graphs, an organization can be more effective in realizing its goals.

Kok-Leong Ong, Wee-Keong Ng, Ee-Peng Lim
Backmatter
Metadaten
Titel
Advances in Knowledge Discovery and Data Mining
herausgegeben von
Ming-Syan Chen
Philip S. Yu
Bing Liu
Copyright-Jahr
2002
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-47887-4
Print ISBN
978-3-540-43704-8
DOI
https://doi.org/10.1007/3-540-47887-6