Invited Speeches

Mining of Evolving Data Streams with Privacy Preservation

The data stream domain has become increasingly important in recent years because of its applicability to a wide variety of applications. Problems such as data mining and privacy preservation which have been studied for traditional data sets cannot be easily solved for the data stream domain. This is because the large volume of data arriving in a stream renders most algorithms to inefficient as most mining and privacy preservation algorithms require multiple scans of data which is unrealistic for stream data. More importantly, the characteristics of the data stream can change over time and the evolving pattern needs to be captured. In this talk, I’ll discuss the issued and focus on how to mine evolving data streams and preserve privacy.

Philip S. Yu

Data Mining Grand Challenges

The past two decades has seen a huge wave of computational systems for the “digitization” of business operations from ERP, to manufacturing, to systems for customer interactions. These systems increased the throughput and efficiency of conducting “transactions” and resulted in an unprecedented build-up of data captured from these systems. The paradoxical reality that most organizations face today is that they have more data about every aspect of their operations and customers, yet they find themselves with an ever diminishing understanding of either. Data Mining has received much attention as a technology that can possibly bridge the gap between data and knowledge.

Usama Fayyad

Session 1A: Classification (I)

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not) and Type II error (how often it indicates no difference when it should). In this paper we argue that the replicability of a test is also of importance. We say that a test has low replicability if its outcome strongly depends on the particular random partitioning of the data that is used to perform it. We present empirical measures of replicability and use them to compare the performance of several popular tests in a realistic setting involving standard learning algorithms and benchmark datasets. Based on our results we give recommendations on which test to use.

Remco R. Bouckaert, Eibe Frank

Spectral Energy Minimization for Semi-supervised Learning

Data mining problems often involve a large amount of unlabeled data and there is often very limited known information on the dataset. In such scenario, semi-supervised learning can often improve classification performance by utilizing unlabeled data for learning. In this paper, we proposed a novel approach to semi-supervised learning as as an optimization of both the classification energy and cluster compactness energy in the unlabeled dataset. The resulting integer programming problem is relaxed by a semi-definite relaxation where efficient solution can be obtained. Furthermore, the spectral graph methods provide improved energy minimization via the incorporation of additional criteria. Results on UCI datasets show promising results.

Chun-hung Li, Zhi-li Wu

Discriminative Methods for Multi-labeled Classification

In this paper we present methods of enhancing existing discriminative classifiers for multi-labeled predictions. Discriminative methods like support vector machines perform very well for uni-labeled text classification tasks. Multi-labeled classification is a harder task subject to relatively less attention. In the multi-labeled setting, classes are often related to each other or part of a is-a hierarchy. We present a new technique for combining text features and features indicating relationships between classes, which can be used with any discriminative algorithm. We also present two enhancements to the margin of SVMs for building better models in the presence of overlapping classes. We present results of experiments on real world text benchmark datasets. Our new methods beat accuracy of existing methods with statistically significant improvements.

Shantanu Godbole, Sunita Sarawagi

Session 1B: Clustering (I)

Subspace Clustering of High Dimensional Spatial Data with Noises

Clustering a large amount of high dimensional spatial data sets with noises is a difficult challenge in data mining. In this paper, we present a new subspace clustering method, called SCI (Subspace Clustering based on Information), to solve this problem. The SCI combines Shannon information with grid-based and density-based clustering techniques. The design of clustering algorithms is equivalent to construct an equivalent relationship among data points. Therefore, we propose an equivalent relationship, named density-connected, to identify the main bodies of clusters. For the purpose of noise detection and cluster boundary discovery, we also use the grid approach to devise a new cohesive mechanism to merge data points of borders into clusters and to filter out the noises. However, the curse of dimensionality is a well-known serious problem of using grid approach on high dimensional data sets because the number of the grid cells grows exponentially in dimensions. To strike a compromise between the randomness and the structure, we propose an automatic method for attribute selection based on the Shannon information. With the merit of only requiring one data scan, algorithm SCI is very efficient with its run time being linear to the size of the input data set. As shown by our experimental results, SCI is very powerful to discover arbitrary shapes of clusters.

Chih-Ming Hsu, Ming-Syan Chen

Constraint-Based Graph Clustering through Node Sequencing and Partitioning

This paper proposes a two-step graph partitioning method to discover constrained clusters with an objective function that follows the well-known min-max clustering principle. Compared with traditional approaches, the proposed method has several advantages. Firstly, the objective function not only follows the theoretical min-max principle but also reflects certain practical requirements. Secondly, a new constraint is introduced and solved to suit more application needs while unconstrained methods can only control the number of produced clusters. Thirdly, the proposed method is general and can be used to solve other practical constraints. The experimental studies on word grouping and result visualization show very encouraging results.

Yu Qian, Kang Zhang, Wei Lai

Mining Expressive Process Models by Clustering Workflow Traces

We propose a general framework for the process mining problem which encompasses the assumption of workflow schema with local constraints only, for it being applicable to more expressive specification languages, independently of the particular syntax adopted. In fact, we provide an effective technique for process mining based on the rather unexplored concept of clustering workflow executions, in which clusters of executions sharing the same structure and the same unexpected behavior (w.r.t. the local properties) are seen as a witness of the existence of global constraints.An interesting framework for assessing the similarity between the original model and the discovered one is proposed, as well as some experimental results evidencing the validity of our approach.

Gianluigi Greco, Antonella Guzzo, Luigi Pontieri, Domenico Saccà

Session 1C: Association Rules (I)

CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. One important problem in mining databases of trees is to find frequently occurring subtrees. However, because of the combinatorial explosion, the number of frequent subtrees usually grows exponentially with the size of the subtrees. In this paper, we present CMTreeMiner, a computationally efficient algorithm that discovers all closed and maximal frequent subtrees in a database of rooted unordered trees. The algorithm mines both closed and maximal frequent subtrees by traversing an enumeration tree that systematically enumerates all subtrees, while using an enumeration DAG to prune the branches of the enumeration tree that do not correspond to closed or maximal frequent subtrees. The enumeration tree and the enumeration DAG are defined based on a canonical form for rooted unordered trees–the depth-first canonical form (DFCF). We compare the performance of our algorithm with that of PathJoin, a recently published algorithm that mines maximal frequent subtrees.

Yun Chi, Yirong Yang, Yi Xia, Richard R. Muntz

Secure Association Rule Sharing

The sharing of association rules is often beneficial in industry, but requires privacy safeguards. One may decide to disclose only part of the knowledge and conceal strategic patterns which we call restrictive rules. These restrictive rules must be protected before sharing since they are paramount for strategic decisions and need to remain private. To address this challenging problem, we propose a unified framework for protecting sensitive knowledge before sharing. This framework encompasses: (a) an algorithm that sanitizes restrictive rules, while blocking some inference channels. We validate our algorithm against real and synthetic datasets; (b) a set of metrics to evaluate attacks against sensitive knowledge and the impact of the sanitization. We also introduce a taxonomy of sanitizing algorithms and a taxonomy of attacks against sensitive knowledge.

Stanley R. M. Oliveira, Osmar R. Zaïane, Yücel Saygin

Self-Similar Mining of Time Association Rules

Although the task of mining association rules has received considerable attention in the literature, algorithms to find time association rules are often inadequate, by either missing rules when the time interval is arbitrarily partitioned in equal intervals or by clustering the data before the search for high-support itemsets is undertaken. We present an e.cient solution to this problem that uses the fractal dimension as an indicator of when the interval needs to be partitioned. The partitions are done with respect to every itemset in consideration, and therefore the algorithm is in a better position to find frequent itemsets that would have been missed otherwise. We present experimental evidence of the e.ciency of our algorithm both in terms of rules that would have been missed by other techniques and also in terms of its scalability with respect to the number of transactions and the number of items in the data set.

Daniel Barbará, Ping Chen, Zohreh Nazeri

Session 2A: Novel Algorithms (I)

ParaDualMiner: An Efficient Parallel Implementation of the DualMiner Algorithm

Constraint based mining finds all itemsets that satisfy a set of predicates. Many constraints can be categorised as being either monotone or antimonotone. Dualminer was the first algorithm that could utilise both classes of constraint simultaneously to prune the search space. In this paper, we present two parallel versions of DualMiner. The ParaDualMiner with Simultaneous Pruning efficiently distributes the task of expensive predicate checking among processors with minimum communication overhead. The ParaDualMiner with Random Polling makes further improvements by employing a dynamic subalgebra partitioning scheme and a better communication mechanism. Our experimental results indicate that both algorithms exhibit excellent scalability.

Roger M. H. Ting, James Bailey, Kotagiri Ramamohanarao

A Novel Distributed Collaborative Filtering Algorithm and Its Implementation on P2P Overlay Network

Collaborative filtering (CF) has proved to be one of the most effective information filtering techniques. However, as their calculation complexity increased quickly both in time and space when the record in user database increases, traditional centralized CF algorithms has suffered from their shortage in scalability. In this paper, we first propose a novel distributed CF algorithm called PipeCF through which we can do both the user database management and prediction task in a decentralized way. We then propose two novel approaches: significance refinement (SR) and unanimous amplification (UA), to further improve the scalability and prediction accuracy of PipeCF. Finally we give the algorithm framework and system architecture of the implementation of PipeCF on Peer-to-Peer (P2P) overlay network through distributed hash table (DHT) method, which is one of the most popular and effective routing algorithm in P2P. The experimental data show that our distributed CF algorithm has much better scalability than traditional centralized ones with comparable prediction efficiency and accuracy.

Peng Han, Bo Xie, Fan Yang, Jiajun Wang, Ruimin Shen

An Efficient Algorithm for Dense Regions Discovery from Large-Scale Data Streams

We introduce the notion of dense region as distinct and meaningful patterns from given data. Efficient and effective algorithms for identifying such regions are presented. Next, we discuss extensions of the algorithms for handling data streams. Finally, experiments on large-scale data streams such as clickstreams are given which confirm that the usefulness of our algorithms.

Andy M. Yip, Edmond H. Wu, Michael K. Ng, Tony F. Chan

Blind Data Linkage Using n-gram Similarity Comparisons

Integrating or linking data from different sources is an increasingly important task in the preprocessing stage of many data mining projects. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. If no common unique entity identifiers (keys) are available in all data sources, the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to release their data for linkage by other parties. We present methods for confidential data linkage based on hash encoding, public key encryption and n-gram similarity comparison techniques, and show how blind data linkage can be performed.

Tim Churches, Peter Christen

Condensed Representation of Emerging Patterns

Emerging patterns (EPs) are associations of features whose frequencies increase significantly from one class to another. They have been proven useful to build powerful classifiers and to help establishing diagnosis. Because of the huge search space, mining and representing EPs is a hard task for large datasets. Thanks to the use of recent results on condensed representations of frequent closed patterns, we propose here an exact condensed representation of EPs. We also give a method to provide EPs with the highest growth rates, we call them strong emerging patterns (SEPs). In collaboration with the Philips company, experiments show the interests of SEPs.

Arnaud Soulet, Bruno Crémilleux, François Rioult

Session 2B: Association (II)

Discovery of Maximally Frequent Tag Tree Patterns with Contractible Variables from Semistructured Documents

In order to extract meaningful and hidden knowledge from semistructured documents such as HTML or XML files, methods for discovering frequent patterns or common characteristics in semistructured documents have been more and more important. We propose new methods for discovering maximally frequent tree structured patterns in semistructured Web documents by using tag tree patterns as hypotheses. A tag tree pattern is an edge labeled tree which has ordered or unordered children and structured variables. An edge label is a tag or a keyword in such Web documents, and a variable can match an arbitrary subtree, which represents a field of a semistructured document. As a special case, a contractible variable can match an empty subtree, which represents a missing field in a semistructured document. Since semistructured documents have irregularities such as missing fields, a tag tree pattern with contractible variables is suited for representing tree structured patterns in such semistructured documents. First, we present an algorithm for generating all maximally frequent ordered tag tree patterns with contractible variables. Second, we give an algorithm for generating all maximally frequent unordered tag tree patterns with contractible variables.

Tetsuhiro Miyahara, Yusuke Suzuki, Takayoshi Shoudai, Tomoyuki Uchida, Kenichi Takahashi, Hiroaki Ueda

Mining Term Association Rules for Heuristic Query Construction

As the Web has become an important channel of information floods, users have had difficulty on identifying what they really want from huge amounts of rubbish-like information provided by the Web search engines when utilizing the Web’s low-cost information. This is because most users can only give inadequate (or incomplete) expressions for representing their requirements when querying the Web. In this paper, a heuristic model is proposed for tackling the inadequate query problem. Our approach is based on the potentially useful relationships among terms, called term association rules, in text corpus. For identifying quality information, a constraint is designed for capturing the goodness of queries. The heuristic information in our model assists users in expressing their queries desired.

Zhenxing Qin, Li Liu, Shichao Zhang

FP-Bonsai: The Art of Growing and Pruning Small FP-Trees

In the context of mining frequent itemsets, numerous strategies have been proposed to push several types of constraints within the most well known algorithms. In this paper, we integrate the recently proposed ExAnte data reduction technique within the FP-growth algorithm. Together, they result in a very efficient frequent itemset mining algorithm that effectively exploits monotone constraints.

Francesco Bonchi, Bart Goethals

Mining Negative Rules Using GRD

GRD is an algorithm for k-most interesting rule discovery. In contrast to association rule discovery, GRD does not require the use of a minimum support constraint. Rather, the user must specify a measure of interestingness and the number of rules sought (k). This paper reports efficient techniques to extend GRD to support mining of negative rules. We demonstrate that the new approach provides tractable discovery of both negative and positive rules.

Dhananjay R. Thiruvady, Geoff I. Webb

Applying Association Rules for Interesting Recommendations Using Rule Templates

In this paper, we propose a new method of applying association rules for recommendation systems. Association rule algorithms are used to discover associations among items in transaction datasets. However, applying association rule algorithms directly to make recommendations usually generates too many rules; thus, it is difficult to find interesting recommendations for users among so many rules. Rule templates define certain types of rules; therefore, they are one of the interestingness measures that reduce the number of rules that do not interest users. We describe a new method. By defining more appropriate rule templates, we are able to extract interesting rules for users in a recommendation system. Experimental results show that our method increases the accuracy of recommendations.

Jiye Li, Bin Tang, Nick Cercone

Session 2C: Classification (II)

Feature Extraction and Classification System for Nonlinear and Online Data

A novel incremental feature extraction and classification system is proposed. Kernel PCA is famous nonlinear feature extraction method. The problem of Kernel PCA is that the computation becomes prohibitive when the data set is large. Another problem is that, in order to update the eigenvectors with another data, the whole eigenspace should be recomputed. Proposed feature extraction method overcomes these problems by incrementally eigenspace update and using empirical kernel map as kernel function. Proposed feature extraction method is more efficient in memory requirement than a Kernel PCA and can be easily improved by re-learning the data. For classification extracted features are used as input for Least Squares SVM. In our experiments we show that proposed feature extraction method is comparable in performance to a Kernel PCA and proposed classification system shows a high classification performance on UCI benchmarking data and NIST handwritten data set.

Byung Joo Kim, Il Kon Kim, Kwang Baek Kim

A Metric Approach to Building Decision Trees Based on Goodman-Kruskal Association Index

We introduce a numerical measure on sets of partitions of finite sets that is linked to the Goodman-Kruskal association index commonly used in statistics. This measure allows us to define a metric on such partions used for constructing decision trees. Experimental results suggest that by replacing the usual splitting criterion used in C4.5 by a metric criterion based on the Goodman-Kruskal coefficient it is possible, in most cases, to obtain smaller decision trees without sacrificing accuracy.

Dan A. Simovici, Szymon Jaroszewicz

DRC-BK: Mining Classification Rules with Help of SVM

Currently, the accuracy of SVM classifier is very high, but the classification model of SVM classifier is not understandable by human experts. In this paper, we use SVM, which is applied with a Boolean kernel, to construct a hyper-plan for classification, and mine classification rules from this hyper-plane. In this way, we build DRC-BK, a decision rule classifier. Experiment results show that DRC-BK has a higher accuracy than some state-of-art decision rule (decision tree) classifiers, such as C4.5, CBA, CMAR, CAEP and so on.

Yang Zhang, Zhanhuai Li, Yan Tang, Kebin Cui

A New Data Mining Method Using Organizational Coevolutionary Mechanism

Organizational coevolutionary algorithm for classification (OCEC), is designed with the intrinsic properties of data mining in mind. OCEC makes groups of examples evolved, and then rules are extracted from these groups of examples at the end of evolution. OCEC is first compared with G-NET and JoinGA. All results show that OCEC achieves a higher predictive accuracy. Then, the scalability of OCEC is studied. The results show that the classification time of OCEC increases linearly.

Jing Liu, Weicai Zhong, Fang Liu, Licheng Jiao

Noise Tolerant Classification by Chi Emerging Patterns

Classification is an important data mining problem. A desirable property of a classifier is noise tolerance. Emerging Patterns (EPs) are itemsets whose supports change significantly from one data class to another. In this paper, we first introduce Chi Emerging Patterns (Chi EPs), which are more resistant to noise than other kinds of EPs. We then use Chi EPs in a probabilistic approach for classification. The classifier, Bayesian Classification by Chi Emerging Patterns (BCCEP), can handle noise very well due to the inherent noise tolerance of the Bayesian approach and high quality patterns used in the probability approximation. The empirical study shows that our method is superior to other well-known classification methods such as NB, C4.5, SVM and JEP-C in terms of overall predictive accuracy, on “noisy” as well as “clean” benchmark datasets from the UCI Machine Learning Repository. Out of the 116 cases, BCCEP wins on 70 cases, NB wins on 30, C4.5 wins on 33, SVM wins on 32 and JEP-C wins on 21.

Hongjian Fan, Kotagiri Ramamohanarao

The Application of Emerging Patterns for Improving the Quality of Rare-Class Classification

The classification of rare cases is a challenging problem in many real life applications. The scarcity of the rare cases makes it difficult for traditional classifiers to classify them correctly. In this paper, we propose a new approach to use emerging patterns (EPs) [3] in rare-class classification (EPRC). Traditional EP-based classifiers [2] fail to achieve accepted results when dealing with rare cases. EPRC overcomes this problem by applying three improving stages: generating new undiscovered EPs for the rare class, pruning low-support EPs, and increasing the supports of the rare-class EPs. An experimental evaluation carried out on a number of rare-class databases shows that EPRC outperforms EP-based classifiers as well as other classification methods such as PNrule [1], Metacost [6], and C4.5 [7].

Hamad Alhammady, Kotagiri Ramamohanarao

Session 3A: Event Mining, Anomaly Detection, and Intrusion Detection

Finding Negative Event-Oriented Patterns in Long Temporal Sequences

Pattern discovery in a long temporal event sequence is of great importance in many application domains. Most of the previous work focuses on identifying positive associations among time stamped event types. In this paper, we introduce the problem of defining and discovering negative associations that, as positive rules, may also serve as a source of knowledge discovery.In general, an event-oriented pattern is a pattern that associates with a selected type of event, called a target event. As a counter-part of previous research, we identify patterns that have a negative relationship with the target events. A set of criteria is defined to evaluate the interestingness of patterns associated with such negative relationships. In the process of counting the frequency of a pattern, we propose a new approach, called unique minimal occurrence, which guarantees that the Apriori property holds for all patterns in a long sequence. Based on the interestingness measures, algorithms are proposed to discover potentially interesting patterns for this negative rule problem. Finally, the experiment is made for a real application.

Xingzhi Sun, Maria E. Orlowska, Xue Li

OBE: Outlier by Example

Outlier detection in large datasets is an important problem. There are several recent approaches that employ very reasonable definitions of an outlier. However, a fundamental issue is that the notion of which objects are outliers typically varies between users or, even, datasets. In this paper, we present a novel solution to this problem, by bringing users into the loop. Our OBE (Outlier By Example) system is, to the best of our knowledge, the first that allows users to give some examples of what they consider as outliers. Then, it can directly incorporate a small number of such examples to successfully discover the hidden concept and spot further objects that exhibit the same “outlier-ness” as the examples. We describe the key design decisions and algorithms in building such a system and demonstrate on both real and synthetic datasets that OBE can indeed discover outliers that match the users’ intentions.

Cui Zhu, Hiroyuki Kitagawa, Spiros Papadimitriou, Christos Faloutsos

Temporal Sequence Associations for Rare Events

In many real world applications, systematic analysis of rare events, such as credit card frauds and adverse drug reactions, is very important. Their low occurrence rate in large databases often makes it difficult to identify the risk factors from straightforward application of associations and sequential pattern discovery. In this paper we introduce a heuristic to guide the search for interesting patterns associated with rare events from large temporal event sequences. Our approach combines association and sequential pattern discovery with a measure of risk borrowed from epidemiology to assess the interestingness of the discovered patterns. In the experiments, we successfully identify a known drug and several new drug combinations with high risk of adverse reactions. The approach is also applicable to other applications where rare events are of primary interest.

Jie Chen, Hongxing He, Graham Williams, Huidong Jin

Summarization of Spacecraft Telemetry Data by Extracting Significant Temporal Patterns

This paper presents a method to summarize massive spacecraft telemetry data by extracting significant event and change patterns in the low-level time-series data. This method first transforms the numerical time-series into a symbol sequence by a clustering technique using DTW distance measure, then detects event patterns and change points in the sequence. We demonstrate that our method can successfully summarize the large telemetry data of an actual artificial satellite, and help human operators to understand the overall system behavior.

Takehisa Yairi, Shiro Ogasawara, Koichi Hori, Shinichi Nakasuka, Naoki Ishihama

An Extended Negative Selection Algorithm for Anomaly Detection

This paper proposes an extended negative selection algorithm for anomaly detection. Unlike previously proposed negative selection algorithms which do not make use of non-self data, the extended negative selection algorithm first acquires prior knowledge about the characteristics of the Problem space from the historial sample data by using machine learning techniques. Such data consists of both self data and non-self data. The acquired prior knowledge is represented in the form of production rules and thus viewed as common schemata which characterise the two subspaces: self-subspace and non-self-subspace, and provide important information to the generation of detection rules. One advantage of our approach is that it does not rely on the structured representation of the data and can be applied to general anomaly detection. To test the effectiveness, we test our approach through experiments with the public data set iris and KDD’99 published data set.

Xiaoshu Hang, Honghua Dai

Adaptive Clustering for Network Intrusion Detection

A major challenge in network intrusion detection is how to perform anomaly detection. In practice, the characteristics of network traffic are typically non-stationary, and can vary over time. In this paper, we present a solution to this problem by developing a time-varying modification of a standard clustering technique, which means we can automatically accommodate non-stationary traffic distributions. In addition, we demonstrate how feature weighting can improve the classification accuracy of our anomaly detection system for certain types of attacks.

Joshua Oldmeadow, Siddarth Ravinutala, Christopher Leckie

Session 3B: Ensemble Learning

Ensembling MML Causal Discovery

This paper presents an ensemble MML approach for the discovery of causal models. The component learners are formed based on the MML causal induction methods. Six different ensemble causal induction algorithms are proposed. Our experiential results reveal that (1) the ensemble MML causal induction approach has achieved an improved result compared with any single learner in terms of learning accuracy and correctness; (2) Among all the ensemble causal induction algorithms examined, the weighted voting without seeding algorithm outperforms all the rest; (3) It seems that the ensembled CI algorithms could alleviate the local minimum problem. The only drawback of this method is that the time complexity is increased by δ times, where δ is the ensemble size.

Honghua Dai, Gang Li, Zhi-Hua Zhou

Logistic Regression and Boosting for Labeled Bags of Instances

In this paper we upgrade linear logistic regression and boosting to multi-instance data, where each example consists of a labeled bag of instances. This is done by connecting predictions for individual instances to a bag-level probability estimate by simple averaging and maximizing the likelihood at the bag level—in other words, by assuming that all instances contribute equally and independently to a bag’s label. We present empirical results for artificial data generated according to the underlying generative model that we assume, and also show that the two algorithms produce competitive results on the Musk benchmark datasets.

Xin Xu, Eibe Frank

Fast and Light Boosting for Adaptive Mining of Data Streams

Supporting continuous mining queries on data streams requires algorithms that (i) are fast, (ii) make light demands on memory resources, and (iii) are easily to adapt to concept drift. We propose a novel boosting ensemble method that achieves these objectives. The technique is based on a dynamic sample-weight assignment scheme that achieves the accuracy of traditional boosting without requiring multiple passes through the data. The technique assures faster learning and competitive accuracy using simpler base models. The scheme is then extended to handle concept drift via change detection. The change detection approach aims at significant data changes that could cause serious deterioration of the ensemble performance, and replaces the obsolete ensemble with one built from scratch. Experimental results confirm the advantages of our adaptive boosting scheme over previous approaches.

Fang Chu, Carlo Zaniolo

Compact Dual Ensembles for Active Learning

Generic ensemble methods can achieve excellent learning performance, but are not good candidates for active learning because of their different design purposes. We investigate how to use diversity of the member classifiers of an ensemble for efficient active learning. We empirically show, using benchmark data sets, that (1) to achieve a good (stable) ensemble, the number of classifiers needed in the ensemble varies for different data sets; (2) feature selection can be applied for classifier selection from ensembles to construct compact ensembles with high performance. Benchmark data sets and a real-world application are used to demonstrate the effectiveness of the proposed approach.

Amit Mandvikar, Huan Liu, Hiroshi Motoda

On the Size of Training Set and the Benefit from Ensemble

In this paper, the impact of the size of the training set on the benefit from ensemble, i.e. the gains obtained by employing ensemble learning paradigms, is empirically studied. Experiments on Bagged/ Boosted J4.8 decision trees with/without pruning show that enlarging the training set tends to improve the benefit from Boosting but does not significantly impact the benefit from Bagging. This phenomenon is then explained from the view of bias-variance reduction. Moreover, it is shown that even for Boosting, the benefit does not always increase consistently along with the increase of the training set size since single learners sometimes may learn relatively more from additional training data that are randomly provided than ensembles do. Furthermore, it is observed that the benefit from ensemble of unpruned decision trees is usually bigger than that from ensemble of pruned decision trees. This phenomenon is then explained from the view of error-ambiguity balance.

Zhi-Hua Zhou, Dan Wei, Gang Li, Honghua Dai

Session 3C: Bayesian Network and Graph Mining

Identifying Markov Blankets Using Lasso Estimation

Determining the causal relation among attributes in a domain is a key task in data mining and knowledge discovery. The Minimum Message Length (MML) principle has demonstrated its ability in discovering linear causal models from training data. To explore the ways to improve efficiency, this paper proposes a novel Markov Blanket identification algorithm based on the Lasso estimator. For each variable, this algorithm first generates a Lasso tree, which represents a pruned candidate set of possible feature sets. The Minimum Message Length principle is then employed to evaluate all those candidate feature sets, and the feature set with minimum message length is chosen as the Markov Blanket. Our experiment results show the ability of this algorithm. In addition, this algorithm can be used to prune the search space of causal discovery, and further reduce the computational cost of those score-based causal discovery algorithms.

Gang Li, Honghua Dai, Yiqing Tu

Selective Augmented Bayesian Network Classifiers Based on Rough Set Theory

The naive Bayes classifier is widely used in interactive applications due to its computational efficiency, direct theoretical base, and competitive accuracy. However, its attribute independence assumption can result in sub-optimal accuracy. A number of techniques have explored simple relaxations of the attribute independence assumption in order to increase accuracy. TAN is a state-of-the-art extension of naive Bayes, that can express limited forms of inter-dependence among attributes. Rough sets theory provides tools for expressing inexact or partial dependencies within dataset. In this paper, we present a variant of TAN using rough sets theory and compare their tree classifier structures, which can be thought of as a selective restricted trees Bayesian classifier. It delivers lower error than both pre-existing TAN-based classifiers, with substantially less computation than is required by the SuperParent approach.

Zhihai Wang, Geoffrey I. Webb, Fei Zheng

Using Self-Consistent Naive-Bayes to Detect Masquerades

To gain access to account privileges, an intruder masquerades as the proper account user. This paper proposes a new strategy for detecting masquerades in a multiuser system. To detect masquerading sessions, one profile of command usage is built from the sessions of the proper user, and a second profile is built from the sessions of the remaining known users. The sequence of the commands in the sessions is reduced to a histogram of commands, and the naive-Bayes classifier is used to decide the identity of new incoming sessions. The standard naive-Bayes classifier is extended to take advantage of information from new unidentified sessions. On the basis of the current profiles, a newly presented session is first assigned a probability of being a masquerading session, and then the profiles are updated to reflect the new session. As prescribed by the expectation-maximization algorithm, this procedure is iterated until both the probabilities and the profiles are self-consistent. Experiments on a standard artificial dataset demonstrate that this self-consistent naive-Bayes classifier beats the previous best-performing detector and reduces the missing-alarm rate by 40%.

Kwong H. Yung

DB-Subdue: Database Approach to Graph Mining

In contrast to mining over transactional data, graph mining is done over structured data represented in the form of a graph. Data having structural relationships lends itself to graph mining. Subdue is one of the early main memory graph mining algorithms that detects the best substructure that compresses a graph using the minimum description length principle. Database approach to graph mining presented in this paper overcomes the problems – performance and scalability – inherent to main memory algorithms. The focus of this paper is the development of graph mining algorithms (specifically Subdue) using SQL and stored procedures in a Relational database environment. We have not only shown how the Subdue class of algorithms can be translated to SQL-based algorithms, but also demonstrated that scalability can be achieved without sacrificing performance.

Sharma Chakravarthy, Ramji Beera, Ramanathan Balachandran

Session 3D: Text Mining (I)

Finding Frequent Structural Features among Words in Tree-Structured Documents

Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W₁,W₂,...,W_k) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W₁, W₂,..., W_k) is a sequence 〈t₁;t₂;...,t_k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t_i consists of only one node having the pair (W_i,W_i + 1) as its label, or (2) t_i has just two nodes whose degrees are one and which are labeled with W_i and W_i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.

Tomoyuki Uchida, Tomonori Mogawa, Yasuaki Nakamura

Exploring Potential of Leave-One-Out Estimator for Calibration of SVM in Text Mining

This paper investigates a number of techniques for calibration of the output of a Support Vector Machine in order to provide a posterior probability P(target class | instance).Five basic calibration techniques are combined with five ways of correcting the SVM scores on the training set. The calibration techniques used are addition of a simple ramp function, allocation of a Gaussian density, fitting of a sigmoid to the output and two binning techniques. The correction techniques include three methods that are based on recent theoretical advances in leave-one-out estimators and two that are variants of hold-out validation set. This leads us to thirty different settings (including calibration on uncorrected scores). All thirty methods are evaluated for two linear SVMs (one with linear and one with quadratic penalty) and for the ridge regression model (regularisation network) on three categories of the Reuters Newswires benchmark and the WebKB dataset. The performance of these methods are compared to both the probabilities generated by a naive Bayes classifier as well as a calibrated centroid classifier.The main conclusions of this research are: (i) simple calibrators such as ramp and sigmoids perform remarkably well, (ii) score correctors using leave-one-out techniques can perform better than those using validation sets, however, cross-validation methods allow more reliable estimation of test error from the training data.

Adam Kowalczyk, Bhavani Raskutti, Herman Ferrá

Classifying Text Streams in the Presence of Concept Drifts

Concept drifting is always an interesting problem. For instance, a user is interested in a set of topics, X, for a period, may switches to a different set of topics, Y, in the next period. In this paper, we focus on two issues of concept drifts, namely, concept drifts detection and model adaptation in a text stream context. We use statistical control to detect concept drifts, and propose a new multi-classifier strategy for model adaptation. We conducted extensive experiments and reported our findings in this paper.

Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu

Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification

We propose a method of selecting initial training examples for active learning so that it can reach high performance faster with fewer further queries. Our method divides the unlabeled examples into clusters of similar ones and then selects from each cluster the most representative example which is the one closest to the cluster’s centroid. These representative examples are labeled by the user and become the members of the initial training set. We also promote inclusion of what we call model examples in the initial training set. Although the model examples which are in fact the centroids of the clusters are not real examples, their contribution to enhancement of classification accuracy is significant because they represent a group of similar examples so well. Experiments with various text data sets have shown that the active learner starting from the initial training set selected by our method reaches higher accuracy faster than that starting from randomly generated initial training set.

Jaeho Kang, Kwang Ryel Ryu, Hyuk-Chul Kwon

Spectral Analysis of Text Collection for Similarity-Based Clustering

Clustering of natural text collections is generally difficult due to the high dimensionality, heterogeneity, and large size of text collections. These characteristics compound the problem of determining the appropriate similarity space for clustering algorithms. In this paper, we propose to use the spectral analysis of the similarity space of a text collection to predict clustering behavior before actual clustering is performed. Spectral analysis is a technique that has been adopted across different domains to analyze the key encoding information of a system. Spectral analysis for prediction is useful in first determining the quality of the similarity space and discovering any possible problems the selected feature set may present. Our experiments showed that such insights can be obtained by analyzing the spectrum of the similarity matrix of a text collection. We showed that spectrum analysis can be used to estimate the number of clusters in advance.

Wenyuan Li, Wee-Keong Ng, Ee-Peng Lim

Session 4A: Clustering (II)

Clustering Multi-represented Objects with Noise

Traditional clustering algorithms are based on one representation space, usually a vector space. However, in a variety of modern applications, multiple representations exist for each object. Molecules for example are characterized by an amino acid sequence, a secondary structure and a 3D representation. In this paper, we present an efficient density-based approach to cluster such multi-represented data, taking all available representations into account. We propose two different techniques to combine the information of all available representations dependent on the application. The evaluation part shows that our approach is superior to existing techniques.

Karin Kailing, Hans-Peter Kriegel, Alexey Pryakhin, Matthias Schubert

Providing Diversity in K-Nearest Neighbor Query Results

Given a point query Q in multi-dimensional space, K-Nearest Neighbor (KNN) queries return the K closest answers in the database with respect to Q. In this scenario, it is possible that a majority of the answers may be very similar to one or more of the other answers, especially when the data has clusters. For a variety of applications, such homogeneous result sets may not add value to the user. In this paper, we consider the problem of providing diversity in the results of KNN queries, that is, to produce the closest result set such that each answer is sufficiently different from the rest. We first propose a user-tunable definition of diversity, and then present an algorithm, called MOTLEY, for producing a diverse result set as per this definition. Through a detailed experimental evaluation we show that MOTLEY can produce diverse result sets by reading only a small fraction of the tuples in the database. Further, it imposes no additional overhead on the evaluation of traditional KNN queries, thereby providing a seamless interface between diversity and distance.

Anoop Jain, Parag Sarda, Jayant R. Haritsa

Cluster Structure of K-means Clustering via Principal Component Analysis

K-means clustering is a popular data clustering algorithm. Principal component analysis (PCA) is a widely used statistical technique for dimension reduction. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering, with a clear simplex cluster structure. Our results prove that PCA-based dimension reductions are particularly effective for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix.

Chris Ding, Xiaofeng He

Combining Clustering with Moving Sequential Pattern Mining: A Novel and Efficient Technique

Sequential pattern mining is a well-studied problem. In the context of mobile computing, moving sequential patterns that reflects the moving behavior of mobile users attracted researchers’ interests recently. In this paper a novel and efficient technique is proposed to mine moving sequential patterns. Firstly the idea of clustering is introduced to process the original moving histories into moving sequences as a preprocessing step. Then an efficient algorithm called PrefixTree is presented to mine the moving sequences. Performance study shows that PrefixTree outperforms LM algorithm, which is revised to mine moving sequences, in mining large moving sequence databases.

Shuai Ma, Shiwei Tang, Dongqing Yang, Tengjiao Wang, Jinqiang Han

An Alternative Methodology for Mining Seasonal Pattern Using Self-Organizing Map

In retail industry, it is very important to understand seasonal sales pattern, because this knowledge can assist decision makers in managing inventory and formulating marketing strategies. Self-Organizing Map (SOM) is suitable for extracting and illustrating essential structures because SOM has unsupervised learning and topology preserving properties, and prominent visualization techniques. In this experiment, we propose a method for seasonal pattern analysis using Self-Organizing Map. Performance test with real-world data from stationery stores in Indonesia shows that the method is effective for seasonal pattern analysis. The results are used to formulate several marketing and inventory management strategies.

Denny, Vincent C. S. Lee

Session 4B: Association (III)

ISM: Item Selection for Marketing with Cross-Selling Considerations

Many different algorithms are studied on association rules in the literature of data mining. Some researchers are now focusing on the application of association rules. In this paper, we will study one of the application called Item Selection for Marketing (ISM) with cross-selling effect consideration. The problem ISM is to find a subset of items as marketing items in order to boost the sales of the store. We prove a simple version of this problem is NP-hard. We propose an algorithm to deal with this problem. Experiments are conducted to show that the algorithms are effective and efficient.

Raymond Chi-Wing Wong, Ada Wai-Chee Fu

Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining

Mining frequent tree patterns is an important research problems with broad applications in bioinformatics, digital library, e-commerce, and so on. Previous studies highly suggested that pattern-growth methods are efficient in frequent pattern mining. In this paper, we systematically develop the pattern growth methods for mining frequent tree patterns. Two algorithms, Chopper and XSpanner, are devised. An extensive performance study shows that the two newly developed algorithms outperform TreeMinerV [13], one of the fastest methods proposed before, in mining large databases. Furthermore, algorithm XSpanner is substantially faster than Chopper in many cases.

Chen Wang, Mingsheng Hong, Jian Pei, Haofeng Zhou, Wei Wang, Baile Shi

Mining Association Rules from Structural Deltas of Historical XML Documents

Previous work on XML association rule mining focuses on mining from the data existing in XML documents at a certain time point. However, due to the dynamic nature of online information, an XML document typically evolves over time. Knowledge obtained from mining the evolvement of an XML document would be useful in a wide range of applications, such as XML indexing, XML clustering. In this paper, we propose to mine a novel type of association rules from a sequence of changes to XML structure, which we call XML Structural Delta Association Rule (XSD-AR). We formulate the problem of XSD-AR mining by considering both the frequency and the degree of changes to XML structure. An algorithm, which is derived from the FP-growth, and its optimizing strategy are developed for the problem. Preliminary experiment results show that our algorithm is efficient and scalable at discovering a complete set of XSD-ARs.

Ling Chen, Sourav S. Bhowmick, Liang-Tien Chia

Data Mining Proxy: Serving Large Number of Users for Efficient Frequent Itemset Mining

Data mining has attracted a lot of research efforts during the past decade. However, little work has been reported on supporting a large number of users with similar data mining tasks. In this paper, we present a data mining proxy approach that provides basic services so that the overall performance of the system can be maximized for frequent itemset mining. Our data mining proxy is designed to fast respond a user’s request by constructing the required tree using both trees in the proxy that are previously built for other users’ requests and trees stored on disk that are pre-computed from transactional databases. We define a set of basic operations on pattern trees including tree projection and tree merge. Our performance study indicated that the data mining proxy significantly reduces the I/O cost and CPU cost to construct trees. The frequent pattern mining costs with the trees constructed can be minimized.

Zhiheng Li, Jeffrey Xu Yu, Hongjun Lu, Yabo Xu, Guimei Liu

Session 4C: Novel Algorithms (II)

Formal Approach and Automated Tool for Translating ER Schemata into OWL Ontologies

Ontologies play a key role in creating machine-processable Web content to enable the Semantic Web. Extracting domain knowledge from database schemata can profitably support ontology development and then semantic markup of the instance data with the ontologies. The Entity-Relationship (ER) model is an industrial standard for conceptually modeling databases. This paper presents a formal approach and an automated tool for translating ER schemata into Web ontologies in the OWL Web Ontology Language. The tool can firstly read in an XML-coded ER schema produced with ER CASE tools such as PowerDesigner. Following the predefined knowledge-preserving mapping rules from ER schema to OWL DL (a sublanguage of OWL) ontology, it then automatically translates the schema into the ontology in both the abstract syntax and the RDF/XML syntax for OWL DL. Case studies show that the approach is feasible and the tool is efficient, even to large-scale ER schemata.

Zhuoming Xu, Xiao Cao, Yisheng Dong, Wenping Su

Separating Structure from Interestingness

Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations should actually be represented.In this paper we propose a general approach to build condensed representations of pattern collections. The approach is based on separating the structure of the pattern collection from the interestingness values of the patterns. We study also the concrete case of representing the frequent sets and their (approximate) frequencies following this approach: we discuss the trade-offs in representing the frequent sets by the maximal frequent sets, the minimal infrequent sets and their combinations, and investigate the problem approximating the frequencies from samples by giving new upper bounds on sample complexity based on frequent closed sets and describing how convex optimization can be used to improve and score the obtained samples.

Taneli Mielikäinen

Exploiting Recurring Usage Patterns to Enhance Filesystem and Memory Subsystem Performance

In many cases, normal uses of a system form patterns that will repeat. The most common patterns can be collected into a prediction model which will essentially predict that usage patterns common in the past will occur again in the future. Systems can then use the prediction models to provide advance notice to their implementations about how they are likely to be used in the near future. This technique creates opportunities to enhance system implementation performance since implementations can be better prepared to handle upcoming usage.The key component of our system is the ability to intelligently learn about system trends by tracking file system and memory system activity patterns. The usage data that is tracked can be subsequently queried and visualized. More importantly, this data can also be mined for intelligent qualitative and quantitative system enhancements including predictive file prefetching, selective file compression and and application-driven adaptive memory allocation. We conduct an in-depth performance evaluation to demonstrate the potential benefits of the proposed system.

Benjamin Rutt, Srinivasan Parthasarathy

Session 4D: Multimedia Mining

Automatic Text Extraction for Content-Based Image Indexing

This paper proposes an approach for automatic text extraction method using neural networks. Automatic text extraction is a crucial stage for multimedia data mining. We present an artificial neural network (ANNs)-based approach for text extraction in complex images, which uses a combined method of ANNs and non-negative matrix factorization (NMF)-based filtering. An automatically constructed ANN-based classifier can increase recall rates for complex images with small amount of user intervention. NMF-based filtering enhances the precision rate without affecting overall performance. As a result, a combination of two learning mechanism leads to not only robust but also efficient text extraction.

Keechul Jung, Eun Yi Kim

Peculiarity Oriented Analysis in Multi-people Tracking Images

In the place in which many people gather, we may find a suspicious person who is different from others from a security viewpoint. In other words, the person who takes a peculiar action is suspicious. In this paper, we describe an application of our peculiarity oriented mining approach for analysing in image sequences of tracking multiple walking people. A measure of peculiarity, which is called peculiarity factor, is investigated theoretically. The usefulness of our approach is verified by experimental results.

Muneaki Ohshima, Ning Zhong, Y. Y. Yao, Shinichi Murata

AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases

For discovering hidden (latent) variables in real-world, non-gaussian data streams or an n-dimensional cloud of data points, SVD suffers from its orthogonality constraint. Our proposed method, “AutoSplit”, finds features which are mutually independent and is able to discover non-orthogonal features. Thus, (a) finds more meaningful hidden variables and features, (b) it can easily lead to clustering and segmentation, (c) it surprisingly scales linearly with the database size and (d) it can also operate in on-line, single-pass mode. We also propose “Clustering-AutoSplit”, which extends the feature discovery to multiple feature/bases sets, and leads to clean clustering. Experiments on multiple, real-world data sets show that our method meets all the properties above, outperforming the state-of-the-art SVD.

Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, Masafumi Hamamoto

Session 5A: Text Mining and Web Mining (II)

Semantic Sequence Kin: A Method of Document Copy Detection

The string matching and global word frequency model are two basic models of Document Copy Detection, although they are both unsatisfied in some respects. The String Kernel (SK) and Word Sequence Kernel (WSK) may map string pairs into a new feature space directly, in which the data is linearly separable. This idea inspires us with the Semantic Sequence Kin (SSK) and we apply it to document copy detection. SK and WSK only take into account the gap between the first word/term and the last word/term so that it is not good for plagiarism detection. SSK considers each common word’s position information so as to detect plagiarism in a fine granularity. SSK is based on semantic density that is indeed the local word frequency information. We believe these measures diminish the noise of rewording greatly. We test SSK in a small corpus with several common copy types. The result shows that SSK is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.

Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu, Xiao-Di Zhang

Extracting Citation Metadata from Online Publication Lists Using BLAST

Scientific research reports usually contain a list of citations on previous related works. Therefore an automatic citation tool is an essential component of a digital library of scientific literatures. Due to variations in formats, it is difficult to automatically transform semi-structured citation data into structured citations. Some digital library institutes, like ResearchIndex (CiteSeer) or OpCit, have attempted automatic citation parsing. In order to recognize metadata, e.g., authors, title, journal, etc., of a citation string, we present a new methodology based on protein sequence alignment tool. We also develop a template generating system to transform known semi-structured citation strings into protein sequences. These protein sequences are then saved as templates in a database. A new semi-structured citation string is also translated it into a protein sequence. We then use BLAST (Basic Local Alignment Search Tool), a sequence alignment tool, to match for the most similar template to the new protein sequence from the template database previously constructed. We then parse metadata of the citation string according to the template. In our experiment, 2,500 templates are generated by our template generating system. By parsing all of these 2,500 citations using our parsing system, we obtain 89% precision rate. However, using the same template database to train ParaTools, 79% precision rate is obtained. Note that the original ParaTools using its default template database, which contains about 400 templates, only obtains 30% precision rate.

I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, Wen-Chang Lin

Mining of Web-Page Visiting Patterns with Continuous-Time Markov Models

This paper presents a new prediction model for predicting when an online customer leaves a current page and which next Web page the customer will visit. The model can forecast the total number of visits of a given Web page by all incoming users at the same time. The prediction technique can be used as a component for many Web based applications . The prediction model regards a Web browsing session as a continuous-time Markov process where the transition probability matrix can be computed from Web log data using the Kolmogorov’s backward equations. The model is tested against real Web-log data where the scalability and accuracy of our method are analyzed.

Qiming Huang, Qiang Yang, Joshua Zhexue Huang, Michael K. Ng

Discovering Ordered Tree Patterns from XML Queries

Recently, several researches have investigated the discovering of frequent XML query patterns using frequent structure mining techniques. All these works ignore the order properties of XML queries, and therefore are limited in their effectiveness. In this paper, we consider the discovering of ordered query patterns. We propose an algorithm for ordered query pattern mining. Experiments show that our method is efficient.

Yi Chen

Predicting Web Requests Efficiently Using a Probability Model

As the world-wide-web grows rapidly and a user’s browsing experiences are needed to be personalized, the problem of predicting a user’s behavior on a web-site has become important. We present a probability model to utilize path profiles of users from web logs to predict the user’s future requests. Each of the user’s next probable requests is given a conditional probability value, which is calculated according to the function presented by us. Our model can give several predictions ranked by the values of their probability instead of giving one, thus increasing recommending ability. The experiments show that our algorithm and model has a good performance. The result can potentially be applied to a wide range of applications on the web.

Shanchan Wu, Wenyuan Wang

Session 5B: Statistical Methods, Sequential Data Mining, and Time Series Mining

CCMine: Efficient Mining of Confidence-Closed Correlated Patterns

Correlated pattern mining has become increasingly important recently as an alternative or an augmentation of association rule mining. Though correlated pattern mining discloses the correlation relationships among data objects and reduces significantly the number of patterns produced by the association mining, it still generates quite a large number of patterns. In this paper, we propose closed correlated pattern mining to reduce the number of the correlated patterns produced without information loss. We first propose a new notion of the confidence-closed correlated patterns, and then present an efficient algorithm, called CCMine, for mining those patterns. Our performance study shows that confidence-closed pattern mining reduces the number of patterns by at least an order of magnitude. It also shows that CCMine outperforms a simple method making use of the the traditional closed pattern miner. We conclude that confidence-closed pattern mining is a valuable approach to condensing correlated patterns.

Won-Young Kim, Young-Koo Lee, Jiawei Han

A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data

Measuring the similarity between objects described by categorical attributes is a difficult task because no relations between categorical values can be mathematically specified or easily established. In the literature, most similarity (dissimilarity) measures for categorical data consider the similarity of value pairs by considering whether or not these two values are identical. In these methods, the similarity (dissimilarity) of a non-identical value pair is simply considered 0 (1). In this paper, we introduce a dissimilarity measure for categorical data by imposing association relations between non-identical value pairs of an attribute based on their relations with other attributes. The key idea is to measure the similarity between two values of a categorical attribute by the similarities of the conditional probability distributions of other attributes conditioned on these two values. Experiments with a nearest neighbor algorithm demonstrate the merits of our proposal in real-life data sets.

Le Si Quang, Ho Tu Bao

Learning Hidden Markov Model Topology Based on KL Divergence for Information Extraction

To locate information embedded in documents, information extraction systems based on rule-based pattern matching have long been used. To further improve the extraction generalization, hidden Markov model (HMM) has recently been adopted for modeling temporal variations of the target patterns with promising results. In this paper, a state-merging method is adopted for learning the topology with the use of a localized Kullback Leibler (KL) divergence. The proposed system has been applied to a set of domain-specific job advertisements and preliminary experiments show promising results.

Kwok-Chung Au, Kwok-Wai Cheung

A Non-parametric Wavelet Feature Extractor for Time Series Classification

Many representation schemes for time series have been proposed and most of them require predefined parameters. In case of classification, the accuracy is considerably influenced by these predefined parameters. Also, the users usually have difficulty in determining the parameters. The aim of this paper is to develop a representation method for time series that can automatically select the parameters for the classification task. To this end, we exploit the multi-scale property of wavelet decomposition that allows us to automatically extract features and achieve high classification accuracy. Two main contributions of this work are: (1) selecting features of a representation that helps to prevent time series shifts, and (2) choosing appropriate features, namely, features in an appropriate wavelet decomposition scale according to the concentration of wavelet coefficients within this scale.

Hui Zhang, Tu Bao Ho, Mao Song Lin

Rules Discovery from Cross-Sectional Short-Length Time Series

The cross-sectional time series data means a group of multivariate time series each of which has the same set of variables. Usually its length is short. It occurs frequently in business, economics, science, and so on. We want to mine rules from it, such as ”GDP rises if Investment rises in most provinces” in economic analysis. Rule mining is divided into two steps: events distilling and association rules mining. This paper concentrates on the former and applies Apriori to the latter. The paper defines event types based on relative differences. Considering cross-sectional property, we introduce an ANOVA-based event-distilling method which can gain proper events from cross-sectional time series. At last, the experiments on synthetic and real-life data show the advantage of ANOVA-based event-distilling method and the influential factors, relatively to the separately event-distilling method.

Kedong Luo, Jianmin Wang, Jiaguang Sun

Session 5C: Novel Algorithms (III)

Constraint-Based Mining of Formal Concepts in Transactional Data

We are designing new data mining techniques on boolean contexts to identify a priori interesting concepts, i.e., closed sets of objects (or transactions) and associated closed sets of attributes (or items). We propose a new algorithm D-Miner for mining concepts under constraints. We provide an experimental comparison with previous algorithms and an application to an original microarray dataset for which D-Miner is the only one that can mine all the concepts.

Jérémy Besson, Céline Robardet, Jean-François Boulicaut

Towards Optimizing Conjunctive Inductive Queries

Inductive queries are queries to an inductive database that generate a set of patterns in a data mining context. Inductive querying poses new challenges to database and data mining technology. We study conjunctive inductive queries, which are queries that can be written as a conjunction of a monotonic and an anti-monotonic subquery. We introduce the conjunctive inductive query optimization problem, which is concerned with minimizing the cost of computing the answer set to a conjunctive query. In the optimization problem, it is assumed that there are costs c_a and c_m associated to evaluating a pattern w.r.t. a monotonic and an anti-monotonic subquery respectively. The aim is then to minimize the total cost needed to compute all solutions to the query. Secondly, we present an algorithm that aims at optimizing conjunctive inductive queries in the context of the pattern domain of strings and evaluate it on a challenging data set in computational biology.

Johannes Fischer, Luc De Raedt

Febrl – A Parallel Open Source Data Linkage System

In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.

Peter Christen, Tim Churches, Markus Hegland

A General Coding Method for Error-Correcting Output Codes

ECOC approach can be used to reduce a multiclass categorization problem to multiple binary problems and to improve the generalization of classifiers. Yet there is no single coding method that can generate ECOCs suitable for any number of classes. This paper provides a search-coding method that associates nonnegative integers with binary strings. Given any number of classes and an expected minimum hamming distance, the method can find out a satisfied output code through searching an integer range. Experimental results show that, as a general coding method, the search-coding method can improve the generalization for both stable and unstable classifiers efficiently

Yan-huang Jiang, Qiang-li Zhao, Xue-jun Yang

Discovering Partial Periodic Patterns in Discrete Data Sequences

The problem of partial periodic pattern mining in a discrete data sequence is to find subsequences that appear periodically and frequently in the data sequence. Two essential subproblems are the efficient mining of frequent patterns and the automatic discovery of periods that correspond to these patterns. Previous methods for this problem in event sequence databases assume that the periods are given in advance or require additional database scans to compute periods that define candidate patterns. In this work, we propose a new structure, the abbreviated list table (ALT), and several efficient algorithms to compute the periods and the patterns, that require only a small number of passes. A performance study is presented to demonstrate the effectiveness and efficiency of our method.

Huiping Cao, David W. Cheung, Nikos Mamoulis

Session 5D: Biomedical Mining

Conceptual Mining of Large Administrative Health Data

Health databases are characterised by large number of records, large number of attributes and mild density. This encourages data miners to use methodologies that are more sensitive to health undustry specifics. For conceptual mining, the classic pattern-growth methods are found limited due to their great resource consumption. As an alternative, we propose a pattern splitting technique which delivers as complete and compact knowledge about the data as the pattern-growth techniques, but is found to be more efficient.

Tatiana Semenova, Markus Hegland, Warwick Graco, Graham Williams

A Semi-automatic System for Tagging Specialized Corpora

In this paper, we treat the problem of the grammatical tagging of non-annotated corpora of specialty. The existing taggers are trained on general language corpora, and give inconsistent results on the specialized texts, as technical and scientific ones. In order to learn rules adapted to a specialized field, the usual approach labels manually a large corpus of this field. This is extremely time-consuming. We propose here a semi-automatic approach for tagging corpora of specialty. ETIQ, the new tagger we are building, make it possible to correct the base of rules obtained by Brill’s tagger and to adapt it to a corpus of specialty. The user visualizes an initial and basic tagging and corrects it either by extending Brill’s lexicon or by the insertion of specialized lexical and contextual rules. The inserted rules are richer and more flexible than Brill’s ones. To help the expert in this task, we designed an inductive algorithm biased by the “correct” knowledge he acquired beforehand. By using techniques of machine learning and enabling the expert to incorporate knowledge of the field in an interactive and friendly way, we improve the tagging of specialized corpora. Our approach has been applied to a corpus of molecular biology.

Ahmed Amrani, Yves Kodratoff, Oriane Matte-Tailliez

A Tree-Based Approach to the Discovery of Diagnostic Biomarkers for Ovarian Cancer

Computational diagnosis of cancer is a classification problem, and it has two special requirements on a learning algorithm: perfect accuracy and small number of features used in the classifier. This paper presents our results on an ovarian cancer data set. This data set is described by 15154 features, and consists of 253 samples. Each sample is referred to a woman who suffers from ovarian cancer or who does not have. In fact, the raw data is generated by the so-called mass spectrosmetry technology measuring the intensities of 15154 protein or peptide-features in a blood sample for every woman. The purpose is to identify a small subset of the features that can be used as biomarkers to separate the two classes of samples with high accuracy. Therefore, the identified features can be potentially used in routine clinical diagnosis for replacing labour-intensive and expensive conventional diagnosis methods. Our new tree-based method can achieve the perfect 100% accuracy in 10-fold cross validation on this data set. Meanwhile, this method also directly outputs a small set of biomarkers. Then we explain why support vector machines, naive bayes, and k-nearest neighbour cannot fulfill the purpose. This study is also aimed to elucidate the communication between contemporary cancer research and data mining techniques.

Jinyan Li, Kotagiri Ramamohanarao

A Novel Parameter-Less Clustering Method for Mining Gene Expression Data

Clustering analysis has been applied in a wide variety of fields. In recent years, it has even become a valuable and useful technique for in-silico analysis of microarray or gene expression data. Although a number of clustering methods have been proposed, they are confronted with difficulties in the requirements of automation, high quality, and high efficiency at the same time. In this paper, we explore the issue of integration between clustering methods and validation techniques. We propose a novel, parameter-less, and efficient clustering algorithm, namely CST, which is suitable for analysis of gene expression data. Through experimental evaluation, CST is shown to outperform other clustering methods substantially in terms of clustering quality, efficiency, and automation under various types of datasets.

Vincent Shin-Mu Tseng, Ching-Pin Kao

Extracting and Explaining Biological Knowledge in Microarray Data

This paper describes a method of clustering lists of genes mined from a microarray dataset using functional information from the Gene Ontology. The method uses relationships between terms in the ontology both to build clusters and to extract meaningful cluster descriptions. The approach is general and may be applied to assist explanation of other datasets associated with ontologies.

Paul J. Kennedy, Simeon J. Simoff, David Skillicorn, Daniel Catchpoole

Further Applications of a Particle Visualization Framework

Previous work introduced a 3D particle visualization framework that viewed each data point as a particle affected by gravitational forces. We showed the use of this tool for visualizing cluster results and anomaly detection. This paper generalizes the particle visualization framework and demonstrates further applications such as determining the number of clusters and identifies clustering algorithm biases. We don’t claim visualization itself is sufficient in answering these questions. The methods here are best used when combined with other visual and analytic techniques. We have made our visualization software that produces standard VRML available to allow its use for these and other applications.

Ke Yin, Ian Davidson

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter