Skip to main content

2009 | Buch

Advances in Knowledge Discovery and Data Mining

13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings

herausgegeben von: Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, Tu-Bao Ho

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009, held in Bangkok, Thailand, in April 2009. The 39 revised full papers and 73 revised short papers presented together with 3 keynote talks were carefully reviewed and selected from 338 submissions. The papers present new ideas, original research results, and practical development experiences from all KDD-related areas including data mining, data warehousing, machine learning, databases, statistics, knowledge acquisition, automatic scientific discovery, data visualization, causal induction, and knowledge-based systems.

Inhaltsverzeichnis

Frontmatter

Keynote Speeches

KDD for BSN – Towards the Future of Pervasive Sensing

With increasing sophistication and miniaturisation of wireless sensor technologies, integrated microsensors no more than a few millimetres in size combined with onboard processing and wireless data transfer has become a reality. The provision of ubiquitous and pervasive monitoring of physical, physiological, and biochemical parameters in any environment and without activity restriction and behaviour modification is the primary motivation of Body Sensor Network (BSN) research. The general scope of BSN is broad, ranging from monitoring of patients with chronic disease and care for the elderly, to general well-being monitoring and performance evaluation in sports. It also has important applications in gaming and human-computer-interaction. One of the significant challenges of BSN is the provision of context aware sensing with effective multi-sensor fusion, data inferencing, mining, and trend analysis. Other research issues currently being addressed include novel miniaturised bioelectrical, biochemical, biophysical, and mechanical sensors; low power RF transceiver, energy scavenging, and battery technologies; biocompatibility, materials, system integration and miniaturisation; autonomic sensor networks and light-weight communication protocols and standards. This talk will address some of the key research topics and current advances in BSN, particularly those related to the KDD community. It will also cover the use of bio-inspired design for providing distributed inferencing and ultra-low power on-node processing, demonstrating how this alternate paradigm based on the strategies used by biological systems can be used to deal with the challenges of scale, complexity, heterogeneity, and uncertainty involved in pervasive sensing.

Guang-Zhong Yang
Finding Hidden Structures in Relational Databases

Relational database management systems have been widely used over decades. An important research issue is to find hidden structural information in large relational databases. By hidden structural information we mean the information that cannot be easily found using a traditional query language SQL. In this talk, we discuss how to find hidden structural information in a relational database by viewing a relational database as a large directed graph where nodes represent tuples and edges represent foreign key references between tuples in the database. We discuss how to find trees and communities in such a large graph for user-given keywords. We also discuss how to find frequent and additional keywords associated with the structures identified in a relational database using SQL.

Jeffrey Xu Yu
The Future of Search: An Online Content Perspective

Nonprofessional creation of public online content has outstripped professional content creation of all forms, both online and offline. And two orders of magnitude more content is created daily to flow through social networks, with as much as two more orders of magnitude still to come as user engagement increases. Content is diversifying in creation, consumption, and nature. Web search engines provide rapid targeted access to this page content, and increasingly to other information such as news articles, weather, movie showtimes, and product and restaurant listings. In this talk, I’ll discuss these trends from the standpoint of the search engine, I’ll cover some research results in this area, and I’ll close with some challenges for the future.

Andrew Tomkins

Regular Papers

DTU: A Decision Tree for Uncertain Data

Decision Tree is a widely used data classification technique. This paper proposes a decision tree based classification method on uncertain data. Data uncertainty is common in emerging applications, such as sensor networks, moving object databases, medical and biological bases. Data uncertainty can be caused by various factors including measurements precision limitation, outdated sources, sensor errors, network latency and transmission problems. In this paper, we enhance the traditional decision tree algorithms and extend measures, including entropy and information gain, considering the uncertain data interval and probability distribution function. Our algorithm can handle both certain and uncertain datasets. The experiments demonstrate the utility and robustness of the proposed algorithm as well as its satisfactory prediction accuracy.

Biao Qin, Yuni Xia, Fang Li
Efficient Privacy-Preserving Link Discovery

Link discovery is a process of identifying association(s) among different entities included in a complex network structure. These association(s) may represent any interaction among entities, for example between people or even bank accounts. The need for link discovery arises in many applications including law enforcement, counter-terrorism, social network analysis, intrusion detection, and fraud detection. Given the sensitive nature of information that can be revealed from link discovery, privacy is a major concern from the perspective of both individuals and organizations. For example, in the context of financial fraud detection, linking transactions may reveal sensitive information about other individuals not involved in any fraud. It is known that link discovery can be done in a privacy-preserving manner by securely finding the transitive closure of a graph. We propose two very efficient techniques to find the transitive closure securely. The two protocols have varying levels of security and performance. We analyze the performance and usability of the proposed approach in terms of both analytical and experimental results.

Xiaoyun He, Jaideep Vaidya, Basit Shafiq, Nabil Adam, Evimaria Terzi, Tyrone Grandison
On Link Privacy in Randomizing Social Networks

Many applications of social networks require relationship anonymity due to the sensitive, stigmatizing, or confidential nature of relationship. Recent work showed that the simple technique of anonymizing graphs by replacing the identifying information of the nodes with random ids does not guarantee privacy since the identification of the nodes can be seriously jeopardized by applying subgraph queries. In this paper, we investigate how well an edge based graph randomization approach can protect sensitive links. We show via theoretical studies and empirical evaluations that various similarity measures can be exploited by attackers to significantly improve their confidence and accuracy of predicted sensitive links between nodes with high similarity values.

Xiaowei Ying, Xintao Wu
Sentence-Level Novelty Detection in English and Malay

Novelty detection (ND) is a process for identifying information from an incoming stream of documents. Although there are many studies of ND on English language documents, however, to the best of our knowledge, none has been reported on Malay documents. This issue is important because there are many documents with a mixture of both English and Malay languages. This paper examines multilingual sentence-level ND in English and Malay documents using TREC 2003 and TREC 2004 Novelty Track data. We describe the text processing for multilingual ND, which consists of language translation, stop words removal, automatic stemming, and novel sentence detection. We compare the results for sentence-level ND on English and Malay documents and find that the results are fairly similar. Therefore, after preprocessing is performed on Malay documents, our ND algorithm appears to be robust in detecting novel sentences, and can possibly be extended to other alphabet-based languages.

Agus T. Kwee, Flora S. Tsai, Wenyin Tang
Text Categorization Using Fuzzy Proximal SVM and Distributional Clustering of Words

Text Categorization (TC) remains as a potential application area for linear support vector machines (SVMs). Among the numerous linear SVM formulations, we bring forward linear PSVM together with recently proposed distributional clustering (DC) of words to realize its potential in TC realm. DC has been presented as an efficient alternative to conventionally used feature selection in TC. It has been shown that, DC together with linear SVM drastically brings down the dimensionality of text documents without any compromise in classification performance. In this paper we use linear PSVM and its extension Fuzzy PSVM (FPSVM) together with DC for TC. We present experimental results comparing PSVM/FPSVM with linear SVM

light

and SVM

lin

on popular WebKB text corpus. Through numerous experiments on subsets of WebKB, we reveal the merits of PSVM and FPSVM over other linear SVMs.

Mani Arun Kumar, Madan Gopal
Cool Blog Classification from Positive and Unlabeled Examples

We address the problem of cool blog classification using only positive and unlabeled examples. We propose an algorithm, called PUB, that exploits the information of unlabeled data together with the positive examples to predict whether the unseen blogs are cool or not. The algorithm uses the weighting technique to assign a weight to each unlabeled example which is assumed to be negative in the training set, and the bagging technique to obtain several weak classifiers, each of which is learned on a small training set generated by randomly sampling some positive examples and some unlabeled examples, which are assumed to be negative. Each of the weak classifiers must achieve admissible performance measure evaluated based on the whole labeled positive examples or has the best performance measure within iteration limit. The majority voting function on all weak classifiers is employed to predict the class of a test instance. The experimental results show that PUB can correctly predict the classes of unseen blogs where this situation cannot be handled by the traditional learning from positive and negative examples. The results also show that PUB outperforms other algorithms for learning from positive and unlabeled examples in the task of cool blog classification.

Kritsada Sriphaew, Hiroya Takamura, Manabu Okumura
Thai Word Segmentation with Hidden Markov Model and Decision Tree

The Thai written language is one of the languages that does not have word boundaries. In order to discover the meaning of the document, all texts must be separated into syllables, words, sentences, and paragraphs. This paper develops a novel method to segment the Thai text by combining a non-dictionary based technique with a dictionary-based technique. This method first applies the Thai language grammar rules to the text for identifying syllables. The hidden Markov model is then used for merging possible syllables into words. The identified words are verified with a lexical dictionary and a decision tree is employed to discover the words unidentified by the lexical dictionary. Documents used in the litigation process of Thai court proceedings have been used in experiments. The results which are segmented words, obtained by the proposed method outperform the results obtained by other existing methods.

Poramin Bheganan, Richi Nayak, Yue Xu
An Efficient Method for Generating, Storing and Matching Features for Text Mining

Log-linear models have been widely used in text mining tasks because it can incorporate a large number of possibly correlated features. In text mining, these possibly correlated features are generated by conjunction of features. They are usually used with log-linear models to estimate robust conditional distributions. To avoid manual construction of conjunction of features, we propose a new algorithmic framework called F-tree for automatically generating and storing conjunctions of features in text mining tasks. This compact graph-based data structure allows fast one-vs-all matching of features in the feature space which is crucial for many text mining tasks. Based on this hierarchical data structure, we propose a systematic method for removing redundant features to further reduce memory usage and improve performance. We do large-scale experiments on three publicly-available datasets and show that this automatic method can get state-of-the-art performance achieved by manual construction of features.

Shing-Kit Chan, Wai Lam
Robust Graph Hyperparameter Learning for Graph Based Semi-supervised Classification

Graph-based semi-supervised learning has attracted much attention in recent years. Many successful methods rely on graph structure to propagate labels from labeled data to unlabeled data. Although graph structure affects the performance of the system, only few works address its construction problem. In this work, the graph structure is constructed by adjusting the hyperparameter controlling the connection weights between nodes. The idea is to learn the hyperparameters for graphs that yields low leave-one-out prediction error on labeled data while retaining high stability of the prediction on unlabeled data. The problem is solved efficiently using the gradient-based optimization technique. Experimental results indicate that the proposed technique yields improvement of generalization error as well as stability of the labeling function.

Krikamol Muandet, Sanparith Marukatat, Cholwich Nattee
Regularized Local Reconstruction for Clustering

Motivated by the local reconstruction approach to discovering low dimensional structure in high dimensional data, we propose a novel clustering algorithm that effectively utilizes local reconstruction information. We obtain the local reconstruction weights by minimizing the reconstruction error between each data point and the reconstruction from its neighbors. An entropy regularization term is incorporated into the reconstruction objective function so that the smoothness of the reconstruction weights can be explicitly controlled. The reconstruction weights are then used to obtain the clustering result by employing spectral clustering techniques. Experimental results on a number of datasets demonstrate that our algorithm performs well relative to other approaches, which validate the effectiveness of our approach for clustering.

Jun Sun, Zhiyong Shen, Bai Su, Yidong Shen
Clustering with Lower Bound on Similarity

We propose a new method, called

SimClus

, for clustering with lower bound on similarity. Instead of accepting

k

the number of clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative (with one representative per cluster). SimClus achieves a

O

(log

n

) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as

O

(

n

). Experiments on real and synthetic datasets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

Mohammad Al Hasan, Saeed Salem, Benjarath Pupacdi, Mohammed J. Zaki
Approximate Spectral Clustering

While spectral clustering has recently shown great promise, computational cost makes it infeasible for use with large data sets. To address this computational challenge, this paper considers the problem of approximate spectral clustering, which enables both the feasibility (of approximately clustering in very large and unloadable data sets) and acceleration (of clustering in loadable data sets), while maintaining acceptable accuracy. We examine and propose several schemes for approximate spectral grouping, and make an empirical comparison of those schemes in combination with several sampling strategies. Experimental results on several synthetic and real-world data sets show that approximate spectral clustering can achieve both the goals of feasibility and acceleration.

Liang Wang, Christopher Leckie, Kotagiri Ramamohanarao, James Bezdek
An Integration of Fuzzy Association Rules and WordNet for Document Clustering

With the rapid growth of text documents, document clustering has become one of the main techniques for organizing large amount of documents into a small number of meaningful clusters. However, there still exist several challenges for document clustering, such as high dimensionality, scalability, accuracy, meaningful cluster labels, and extracting semantics from texts. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F

2

IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discovery fuzzy frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Reuters-21578 dataset. The experimental result shows that our proposed method outperforms the accuracy quality of FIHC, HFTC, and UPGMA.

Chun-Ling Chen, Frank S. C. Tseng, Tyne Liang
Nonlinear Data Analysis Using a New Hybrid Data Clustering Algorithm

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on pre-defined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyperspherical clusters. In this paper, a new hybrid clustering algorithm called

Self-Partition and Self-Merging

(SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density.

Ureerat Wattanachon, Jakkarin Suksawatchon, Chidchanok Lursinsap
A Polynomial-Delay Polynomial-Space Algorithm for Extracting Frequent Diamond Episodes from Event Sequences

In this paper, we study the problem of mining

frequent diamond episodes efficiently

from an input event sequence with sliding a window. Here, a diamond episode is of the form

a

E

b

, which means that every event of

E

follows an event

a

and is followed by an event

b

. Then, we design a polynomial-delay and polynomial-space algorithm

PolyFreqDmd

that finds all of the frequent diamond episodes without duplicates from an event sequence in

O

(|

Σ

|

2

n

) time per an episode and in

O

(|

Σ

| + 

n

) space, where

Σ

and

n

are an alphabet and the length the event sequence, respectively. Finally, we give experimental results on artificial event sequences with varying several mining parameters to evaluate the efficiency of the algorithm.

Takashi Katoh, Hiroki Arimura, Kouichi Hirata
A Statistical Approach for Binary Vectors Modeling and Clustering

This paper presents an approach for Binary feature selection. Our selection technique is based on a principled statistical model using a finite mixture of distributions. In contrast with classic feature selection algorithms that have been proposed in supervised settings, where training data are available and completely labeled, our approach is fully unsupervised. Through some applications, we found that our feature selection model improves the clustering results.

Nizar Bouguila, Khalid Daoudi
Multi-resolution Boosting for Classification and Regression Problems

Various forms of boosting techniques have been popularly used in many data mining and machine learning related applications. Inspite of their great success, boosting algorithms still suffer from a few open-ended problems that require closer investigation. The efficiency of any such ensemble technique significantly relies on the choice of the weak learners and the form of the loss function. In this paper, we propose a novel multi-resolution approach for choosing the weak learners during additive modeling. Our method applies insights from multi-resolution analysis and chooses the optimal learners at multiple resolutions during different iterations of the boosting algorithms. We demonstrate the advantages of using this novel framework for classification tasks and show results on different real-world datasets obtained from the UCI machine learning repository. Though demonstrated specifically in the context of boosting algorithms, our framework can be easily accommodated in general additive modeling techniques.

Chandan K. Reddy, Jin-Hyeong Park
Interval Data Classification under Partial Information: A Chance-Constraint Approach

This paper presents a Chance-constraint Programming approach for constructing maximum-margin classifiers which are robust to interval-valued uncertainty in training examples. The methodology ensures that uncertain examples are classified correctly with high probability by employing chance-constraints. The main contribution of the paper is to pose the resultant optimization problem as a Second Order Cone Program by using large deviation inequalities, due to Bernstein. Apart from support and mean of the uncertain examples these Bernstein based relaxations make no further assumptions on the underlying uncertainty. Classifiers built using the proposed approach are less conservative, yield higher margins and hence are expected to generalize better than existing methods. Experimental results on synthetic and real-world datasets show that the proposed classifiers are better equipped to handle interval-valued uncertainty than state-of-the-art.

Sahely Bhadra, J. Saketha Nath, Aharon Ben-Tal, Chiranjib Bhattacharyya
Negative Encoding Length as a Subjective Interestingness Measure for Groups of Rules

We propose an interestingness measure for groups of classification rules which are mutually related based on the Minimum Description Length Principle. Unlike conventional methods, our interestingness measure is based on a theoretical background, has no parameter, is applicable to a group of any number of rules, and can exploit an initial hypothesis. We have integrated the interestingness measure with practical heuristic search and built a rule-group discovery method CLARDEM (Classification Rule Discovery method based on an Extended-Mdlp). Extensive experiments using both real and artificial data confirm that CLARDEM can discover the correct concept from a small noisy data set and an approximate initial concept with high “discovery accuracy”.

Einoshin Suzuki
The Studies of Mining Frequent Patterns Based on Frequent Pattern Tree

Mining frequent patterns is to discover the groups of items appearing always together excess of a user specified threshold. Many approaches have been proposed for mining frequent pattern. However, either the search space or memory space is huge, such that the performance for the previous approach degrades when the database is massive or the threshold for mining frequent patterns is low. In order to decrease the usage of memory space and speed up the mining process, we study some methods for mining frequent patterns based on frequent pattern tree. The concept of our approach is to only construct a FP-tree and traverse a subtree of the FP-tree to generate all the frequent patterns for an item without constructing any other subtrees. After traversing a subtree for an item, our approach merges and removes the subtree to reduce the FP-tree smaller and smaller. We propose four methods based on this concept and compare the four methods with the famous algorithm FP-Growth which also construct a FP-tree and recursively mines frequent patterns by building conditional FP-tree.

Show-Jane Yen, Yue-Shi Lee, Chiu-Kuang Wang, Jung-Wei Wu, Liang-Yu Ouyang
Discovering Periodic-Frequent Patterns in Transactional Databases

Since mining frequent patterns from transactional databases involves an exponential mining space and generates a huge number of patterns, efficient discovery of user-interest-based frequent pattern set becomes the first priority for a mining algorithm. In many real-world scenarios it is often sufficient to mine a small interesting representative subset of frequent patterns. Temporal periodicity of pattern appearance can be regarded as an important criterion for measuring the interestingness of frequent patterns in several applications. A frequent pattern can be said periodic-frequent if it appears at a regular interval given by the user in the database. In this paper, we introduce a novel concept of mining periodic-frequent patterns from transactional databases. We use an efficient tree-based data structure, called Periodic-frequent pattern tree (PF-tree in short), that captures the database contents in a highly compact manner and enables a pattern growth mining technique to generate the complete set of periodic-frequent patterns in a database for user-given periodicity and support thresholds. The performance study shows that mining periodic-frequent patterns with PF-tree is time and memory efficient and highly scalable as well.

Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, Young-Koo Lee
Quantifying Asymmetric Semantic Relations from Query Logs by Resource Allocation

In this paper we present a bipartite-network-based resource allocation(BNRA) method to extract and quantify semantic relations from large scale query logs of search engine. Firstly, we construct a query-URL bipartite network from query logs of search engine. By BNRA, we extract asymmetric semantic relations between queries from the bipartite network. Asymmetric relation indicates that two related queries could be assigned different semantic relevance strength against each other, which is more conforming to reality. We verify the validity of the method with query logs from Chinese search engine Sogou. It demonstrates BNRA could effectively quantify semantic relations from We further construct query semantic networks, and introduce several measures to analyze the networks. BNRA is not only ‘language oblivious’ and ‘content oblivious’, but could also be easily implemented in a paralleled manner, which provides commercial search engines a feasible solution to handle large scale query logs.

Zhiyuan Liu, Yabin Zheng, Maosong Sun
Acquiring Semantic Relations Using the Web for Constructing Lightweight Ontologies

Common techniques for acquiring semantic relations rely on static domain and linguistic resources, predefined patterns, and the presence of syntactic cues. We propose a hybrid approach which brings together established and novel techniques in lexical simplification, word disambiguation and association inference for acquiring coarse-grained relations between potentially ambiguous and composite terms using only dynamic Web resources. Our experiments using terms from two different domains demonstrate potential preliminary results.

Wilson Wong, Wei Liu, Mohammed Bennamoun
Detecting Abnormal Events via Hierarchical Dirichlet Processes

Detecting abnormal event from video sequences is an important problem in computer vision and pattern recognition and a large number of algorithms have been devised to tackle this problem. Previous state-based approaches all suffer from the problem of deciding the appropriate number of states and it is often difficult to do so except using a trial-and-error approach, which may be infeasible in real-world applications. Yet in this paper, we have proposed a more accurate and flexible algorithm for abnormal event detection from video sequences. Our three-phase approach first builds a set of weak classifiers using Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM), and then proposes an ensemble learning algorithm to filter out abnormal events. In the final phase, we will derive abnormal activity models from the normal activity model to reduce the FP (False Positive) rate in an unsupervised manner. The main advantage of our algorithm over previous ones is to naturally capture the underlying feature in abnormal event detection via HDP-HMM. Experimental results on a real-world video sequence dataset have shown the effectiveness of our algorithm.

Xian-Xing Zhang, Hua Liu, Yang Gao, Derek Hao Hu
Active Learning for Causal Bayesian Network Structure with Non-symmetrical Entropy

Causal knowledge is crucial for facilitating comprehension, diagnosis, prediction, and control in automated reasoning. Active learning in causal Bayesian networks involves interventions by manipulating specific variables, and observing the patterns of change over other variables to derive causal knowledge. In this paper, we propose a new active learning approach that supports interventions with node selection. Our method admits a node selection criterion based on non-symmetrical entropy from the current data and a stop criterion based on structure entropy of the resulting networks. We examine the technical challenges and practical issues involved. Experimental results on a set of benchmark Bayesian networks are promising. The proposed method is potentially useful in many real-life applications where multiple instances are collected as a data set in each active learning step.

Guoliang Li, Tze-Yun Leong
A Comparative Study of Bandwidth Choice in Kernel Density Estimation for Naive Bayesian Classification

Kernel density estimation (KDE) is an important method in nonparametric learning. While KDE has been studied extensively in the context of accuracy of distribution estimation, it has not been studied extensively in the context of classification. This paper studies nine bandwidth selection schemes for kernel density estimation in Naive Bayesian classification context, using 52 machine learning benchmark datasets. The contributions of this paper are threefold. First, it shows that some commonly used and very sophisticated bandwidth selection schemes do not give good performance in Naive Bayes. Surprisingly, some very simple bandwidth selection schemes give statistically significantly better performance. Second, it shows that kernel density estimation can achieve statistically significantly better classification performance than a commonly used discretization method in Naive Bayes, but only when appropriate bandwidth selection schemes are applied. Third, this study gives bandwidth distribution patterns for the investigated bandwidth selection schemes.

Bin Liu, Ying Yang, Geoffrey I. Webb, Janice Boughton
Analysis of Variational Bayesian Matrix Factorization

Recently, the variational Bayesian approximation was applied to probabilistic matrix factorization and shown to perform very well in experiments. However, its good performance was not completely understood beyond its experimental success. The purpose of this paper is to theoretically elucidate properties of a variational Bayesian matrix factorization method. In particular, its mechanism of avoiding overfitting is analyzed. Our analysis relies on the key fact that the matrix factorization model induces non-identifiability, i.e., the mapping between factorized matrices and the original matrix is not one-to-one. The positive-part James-Stein shrinkage operator and the Marcenko-Pastur law—the limiting distribution of eigenvalues of the central Wishart distribution—play important roles in our analysis.

Shinichi Nakajima, Masashi Sugiyama
Variational Bayesian Approach for Long-Term Relevance Feedback

This paper presents a Bayesian approach to address two important issues of image recommendation that are: (1) change in long-term needs of users and (2) evolution of image collections. Users are offered a new interaction modality which allows them to provide either positive or negative relevance feedback (RF) data to express their recent needs. Then, an efficient variational Online learning algorithm updates both user and product collection models by favoring recent RF data. The proposed method is general and can be applied in collaborative filtering. Experimental results demonstrate the importance of maintaining most up-to-date user models on the rating’s prediction accuracy.

Sabri Boutemedjet, Djemel Ziou
Detecting Link Hijacking by Web Spammers

Since current search engines employ link-based ranking algorithms as an important tool to decide a ranking of sites, Web spammers are making a significant effort to manipulate the link structure of the Web, so called, link spamming. Link hijacking is an indispensable technique for link spamming to bring ranking scores from normal sites to target spam sites. In this paper, we propose a link analysis technique for finding link hijacked sites using modified PageRank algorithms. We performed experiments on the large scale Japanese Web archive and evaluated the accuracy of our method. Detection precision of our approach was improved about 25% from a naive approach.

Young-joo Chung, Masashi Toyoda, Masaru Kitsuregawa
A Data Driven Ensemble Classifier for Credit Scoring Analysis

This study focuses on predicting whether a credit applicant can be categorized as good, bad or borderline from information initially supplied. Given its importance, many researchers have recently worked on an ensemble of classifiers. However, to the best of our knowledge, unrepresentative samples drastically reduce the accuracy of the deployment classifier. Few have attempted to preprocess the input samples into more homogeneous cluster groups and then fit the ensemble classifier accordingly. For this reason, we introduce the concept of class-wise classification as a preprocessing step in order to obtain an efficient ensemble classifier. This strategy would work better than a direct ensemble of classifiers without the preprocessing step. The proposed ensemble classifier is constructed by incorporating several data mining techniques, mainly involving optimal associate binning to discretize continuous values; neural network, support vector machine, and Bayesian network are used to augment the ensemble classifier. In particular, the Markov blanket concept of Bayesian network allows for a natural form of feature selection, which provides a basis for mining association rules.

Nan-Chen Hsieh, Lun-Ping Hung, Chia-Ling Ho
A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams

We propose a multi-partition, multi-chunk ensemble classifier based data mining technique to classify concept-drifting data streams. Existing ensemble techniques in classifying concept-drifting data streams follow a single-partition, single-chunk approach, in which a single data chunk is used to train one classifier. In our approach, we train a collection of

v

classifiers from

r

consecutive data chunks using

v

-fold partitioning of the data, and build an ensemble of such classifiers. By introducing this multi-partition, multi-chunk ensemble technique, we significantly reduce classification error compared to the single-partition, single-chunk ensemble approaches. We have theoretically justified the usefulness of our algorithm, and empirically proved its effectiveness over other state-of-the-art stream classification techniques on synthetic data and real botnet traffic.

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham
Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data

The induction error in random tree ensembling results mainly from the strength of decision trees and the dependency between base classifiers. In order to reduce the errors due to both factors, a Semi-Random Decision Tree Ensembling (SRDTE) for mining streaming data is proposed based on our previous work on SRMTDS. The model contains semi-random decision trees that are independent in the generation process and have no interaction with each other in the individual decisions of classification. The main idea is to minimize correlation among the classifiers. We claim that the strength of decision trees is closely related to the estimation values of the parameters, including the height of a tree, the count of trees and the parameter of

n

min

in the Hoeffding Bounds. We analyze these parameters of the model and design strategies for better adaptation to streaming data. The main strategies include an incremental generation of sub-trees after seeing real training instances, a data structure for quick search and a voting mechanism for classification. Our evaluation in the 0-1 loss function shows that SRDTE has improved the performance in terms of predictive accuracy and robustness. We have applied SRDTE to e-business data streams and proved its feasibility and effectiveness.

Peipei Li, Qianhui Liang, Xindong Wu, Xuegang Hu
Exploiting the Block Structure of Link Graph for Efficient Similarity Computation

In many real-world domains, link graph is one of the most effective ways to model the relationships between objects. Measuring the similarity of objects in a link graph is studied by many researchers, but an effective and efficient method is still expected. Based on our observation of link graphs from real domains, we find the block structure naturally exists. We propose an algorithm called

BlockSimRank

, which partitions the link graph into blocks, and obtains similarity of each node-pair in the graph efficiently. Our method is based on random walk on two-layer model, with time complexity as low as

O

(

n

4/3

) and less memory need. Experiments show that the accuracy of

BlockSimRank

is acceptable when the time cost is the lowest.

Pei Li, Yuanzhe Cai, Hongyan Liu, Jun He, Xiaoyong Du
Online Feature Selection Algorithm with Bayesian ℓ1 Regularization

We propose a novel online-learning based feature selection algorithm for supervised learning in the presence of a huge amount of irrelevant features. The key idea of the algorithm is to decompose a nonlinear problem into a set of locally linear ones through local learning, and then estimate the relevance of features globally in a large margin framework with ℓ

1

regularization. Unlike batch learning, the regularization parameter in online learning has to be tuned on-the-fly with the increasing of training data. We address this issue within the Bayesian learning paradigm, and provide an analytic solution for automatic estimation of the regularization parameter via variational methods. Numerical experiments on a variety of benchmark data sets are presented that demonstrate the effectiveness of the newly proposed feature selection algorithm.

Yunpeng Cai, Yijun Sun, Jian Li, Steve Goodison
Feature Selection for Local Learning Based Clustering

For most clustering algorithms, their performance will strongly depend on the data representation. In this paper, we attempt to obtain better data representations through feature selection, particularly for the Local Learning based Clustering (LLC) [1]. We assign a weight to each feature, and incorporate it into the built-in regularization of LLC algorithm to take into account of the relevance of each feature for the clustering. Accordingly, the weights are estimated iteratively with the clustering. We show that the resulting weighted regularization with an additional constraint on the weights is equivalent to a known sparse-promoting penalty, thus the weights for irrelevant features can be driven towards zero. Experiments on several benchmark datasets demonstrate the effectiveness of the proposed method.

Hong Zeng, Yiu-ming Cheung
RV-SVM: An Efficient Method for Learning Ranking SVM

Learning

ranking (or preference)

functions has become an important data mining task in recent years, as various applications have been found in information retrieval. Among rank learning methods, ranking SVM has been favorably applied to various applications, e.g., optimizing search engines, improving data retrieval quality. In this paper, we first develop a 1-norm

ranking

SVM that is faster in

testing

than the standard ranking SVM, and propose Ranking Vector SVM (RV-SVM) that revises the 1-norm ranking SVM for faster

training

. The number of variables in the RV-SVM is significantly smaller, thus the RV-SVM trains much faster than the other ranking SVMs. We experimentally compared the RV-SVM with the state-of-the-art rank learning method provided in SVM-light. The RV-SVM uses much less support vectors and trains much faster for nonlinear kernels than the SVM-light. The accuracies of RV-SVM and SVM-light are comparable on relatively large data sets. Our implementation of RV-SVM is posted at http://iis.postech.ac.kr/rv-svm.

Hwanjo Yu, Youngdae Kim, Seungwon Hwang
A Kernel Framework for Protein Residue Annotation

Over the last decade several prediction methods have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. We developed a general purpose protein residue annotation toolkit (

Pro

SAT

) to allow biologists to formulate residue-wise prediction problems.

Pro

SAT

formulates annotation problem as a classification or regression problem using support vector machines. For every residue

Pro

SAT

captures local information (any sequence-derived information) around the reside to create fixed length feature vectors.

Pro

SAT

implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that allows better capture of signals for certain prediction problems. In this work we evaluate the performance of

Pro

SAT

on the disorder prediction and contact order estimation problems, studying the effect of the different kernels introduced here.

Pro

SAT

shows better or at least comparable performance to state-of-the-art prediction systems. In particular

Pro

SAT

has proven to be the best performing transmembrane-helix predictor on an independent blind benchmark.

Availability:

http://bio.dtc.umn.edu/prosat

Huzefa Rangwala, Christopher Kauffman, George Karypis
Dynamic Exponential Family Matrix Factorization

We propose a new approach to modeling time-varying relational data such as e-mail transactions based on a dynamic extension of matrix factorization. To estimate effectively the true relationships behind a sequence of noise-corrupted relational matrices, their dynamic evolutions are modeled in a space of low-rank matrices. The observed matrices are assumed as to be sampled from an exponential family distribution that has the low-rank matrix as natural parameters. We apply the sequential Bayesian framework to track the variations of true parameters. In the experiments using both artificial and real-world datasets, we demonstrate our method can appropriately estimate time-varying true relations based on noisy observations, more effectively than existing methods.

Kohei Hayashi, Jun-ichiro Hirayama, Shin Ishii
A Nonparametric Bayesian Learning Model: Application to Text and Image Categorization

In this paper a nonparametric Bayesian infinite mixture model is introduced. The adoption of this model is motivated by its flexibility. Indeed, it does not require the specification of the number of mixture components to be given in advance and estimates it in a principled manner. Our approach relies on the estimation of the posterior distribution of clusterings using Gibbs sampler. Through applications involving text and image categorization, we show that categorization via infinite mixture models offers a more powerful and robust performance than classic finite mixtures.

Nizar Bouguila, Djemel Ziou

Short Papers

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

The class imbalanced problem occurs in various disciplines when one of target classes has a tiny number of instances comparing to other classes. A typical classifier normally ignores or neglects to detect a minority class due to the small number of class instances. SMOTE is one of over-sampling techniques that remedies this situation. It generates minority instances within the overlapping regions. However, SMOTE randomly synthesizes the minority instances along a line joining a minority instance and its selected nearest neighbours, ignoring nearby majority instances. Our technique called Safe-Level-SMOTE carefully samples minority instances along the same line with different weight degree, called safe level. The safe level computes by using nearest neighbour minority instances. By synthesizing the minority instances more around larger safe level, we achieve a better accuracy performance than SMOTE and Borderline-SMOTE.

Chumphol Bunkhumpornpat, Krung Sinapiromsaran, Chidchanok Lursinsap
Using Highly Expressive Contrast Patterns for Classification - Is It Worthwhile?

Classification is an important task in data mining. Contrast patterns, such as emerging patterns, have been shown to be powerful for building classifiers, but they rarely exist in sparse data. Recently proposed disjunctive emerging patterns are highly expressive, and can potentially overcome this limitation. Simple contrast patterns only allow simple conjunctions, whereas disjunctive patterns additionally allow expressions of disjunctions. This paper investigates whether expressive contrasts are beneficial for classification. We adopt a statistical methodology for eliminating noisy patterns. Our experiments identify circumstances where expressive patterns can improve over previous contrast pattern based classifiers. We also present some guidelines for i) using expressive patterns based on the nature of the given data, ii) how to choose between the different types of contrast patterns for building a classifier.

Elsa Loekito, James Bailey
Arif Index for Predicting the Classification Accuracy of Features and Its Application in Heart Beat Classification Problem

In this paper, Arif Index is proposed that can be used to assess the discrimination power of features in pattern classification problems. Discrimination power of features play an important role in the classification accuracy of a particular classifier applied to the pattern classification problem. Optimizing the performance of a classifier requires a prior knowledge of maximum achievable accuracy in pattern classification using a particular set of features. Moreover, it is also desirable to know that this set of features is separable by a decision boundary of any arbitrary complexity or not. Proposed index varies linearly with the overlap of features of different classes in the feature space and hence can be used in predicting the classification accuracy of the features that can be achieved by some optimal classifier. Using synthetic data, it is shown that the predicted accuracy and Arif index are very strongly correlated with each other (

R

2

= 0.99). Implementation of the index is simple and time efficient. Index was tested on Arrhythmia beat classification problem and predicted accuracy was found to be in consistent with the reported results.

Muhammad Arif, Fayyaz A. Afsar, Muhammad Usman Akram, Adnan Fida
UCI++: Improved Support for Algorithm Selection Using Datasetoids

As companies employ a larger number of models, the problem of algorithm (and parameter) selection is becoming increasingly important. Two approaches to obtain empirical knowledge that is useful for that purpose are empirical studies and metalearning. However, most empirical (meta)knowledge is obtained from a relatively small set of datasets. In this paper, we propose a method to obtain a large number of datasets which is based on a simple transformation of existing datasets, referred to as

datasetoids

. We test our approach on the problem of using metalearning to predict when to prune decision trees. The results show significant improvement when using datasetoids. Additionally, we identify a number of potential anomalies in the generated datasetoids and propose methods to solve them.

Carlos Soares
Accurate Synthetic Generation of Realistic Personal Information

A large portion of data collected by many organisations today is about people, and often contains personal identifying information, such as names and addresses. Privacy and confidentiality are of great concern when such data is being shared between organisations or made publicly available. Research in (privacy-preserving) data mining and data linkage is suffering from a lack of publicly available real-world data sets that contain personal information, and therefore experimental evaluations can be difficult to conduct. In order to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data containing personal information with realistic characteristics, such as frequency distributions, attribute dependencies, and error probabilities. Our generator significantly improves earlier approaches, and allows the generation of data for individuals, families and households.

Peter Christen, Agus Pudjijono
An Efficient Approximate Protocol for Privacy-Preserving Association Rule Mining

The secure scalar product (or dot product) is one of the most used sub-protocols in privacy-preserving data mining. Indeed, the dot product is probably the most common sub-protocol used. As such, a lot of attention has been focused on coming up with secure protocols for computing it. However, an inherent problem with these protocols is the extremely high computation cost – especially when the dot product needs to be carried out over large vectors. This is quite common in vertically partitioned data, and is a real problem. In this paper, we present ways to efficiently compute the approximate dot product. We implement the dot product protocol and demonstrate the quality of the approximation. Our dot product protocol can be used to securely and efficiently compute association rules from data vertically partitioned between two parties.

Murat Kantarcioglu, Robert Nix, Jaideep Vaidya
Information Extraction from Thai Text with Unknown Phrase Boundaries

Using sliding-window rule application and extraction filtering techniques, we propose a framework for extracting semantic frames from Thai textual phrases with unknown boundaries based on patterns of triggering terms. A supervised rule learning algorithm is used for constructing multi-slot extraction rules from hand-tagged training phrases. A filtering module is introduced for predicting rule application across phrase boundaries based on instantiation features of rule internal wildcards. The framework is applied to text documents in three domains with different target-phrase density and average lengths. The experimental results show that the filtering module improves precision and preserves high recall satisfactorily, yielding extraction performance comparable to frame extraction with manually identified phrase boundaries.

Peerasak Intarapaiboon, Ekawit Nantajeewarawat, Thanaruk Theeramunkong
A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques

This paper presents a corpus-based approach for automatic unknown word recognition in Thai. This approach applies an ensemble learning technique to generate a model for classifying unknown word candidates using features obtained from a corpus. We propose a technique called “group-based evaluation by ranking”. It clusters the unknown word candidates into groups based on the occuring locations. The candidate with the highest accuracy is then identified as an unknown word. In this task, the number of positive instances is dominantly smaller than that of negative instances, forming an unbalanced data set. To improve the prediction accuracy, we apply a boosting technique with “voting under group-based evaluation by ranking”. We have conducted experiments on real-world data to evaluate the performance of the proposed approach. The experiments compared the accuracy of our technique with an ordinary naïve Bayes technique. Our technique achieves the accuracy 90.93±0.50% when the first rank is selected and 97.90±0.26% when the candidates up to the tenth rank are considered. This is 6.79% to 8.45% improvement.

Jakkrit TeCho, Cholwich Nattee, Thanaruk Theeramunkong
A Hybrid Approach to Improve Bilingual Multiword Expression Extraction

We propose a hybrid approach for bilingual multiword expression extraction. There are two phases in the extraction process. In the first phase, lots of candidates are extracted from the corpus by statistic methods. The algorithm of multiple sequence alignment is sensitive to the flexible multiword. In the second phase, error-driven rules and patterns are extracted from corpus. These trained rules are used to filter the candidates. Some related experiments are designed for achieving the best performance because there are lots of parameters in this system. Experimental results showed our approach gains good performance.

Jianyong Duan, Mei Zhang, Lijing Tong, Feng Guo
Addressing the Variability of Natural Language Expression in Sentence Similarity with Semantic Structure of the Sentences

In this paper, we present a new approach that incorporates semantic structure of sentences, in a form of verb-argument structure, to measure semantic similarity between sentences. The variability of natural language expression makes it difficult for existing text similarity measures to accurately identify semantically similar sentences since sentences conveying the same fact or concept may be composed lexically and syntactically different. Inversely, sentences which are lexically common may not necessarily convey the same meaning. This poses a significant impact on many text mining applications’ performance where sentence-level judgment is involved. The evaluation has shown that, by processing sentence at its semantic level, the performance of similarity measures is significantly improved.

Palakorn Achananuparp, Xiaohua Hu, Christopher C. Yang
Scalable Web Mining with Newistic

Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability, which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel.

This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.

Ovidiu Dan, Horatiu Mocian
Building a Text Classifier by a Keyword and Unlabeled Documents

Traditional approaches for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we study the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually. Firstly, we expand the keyword into a set of query terms and retrieve a set of documents from the set of unlabeled documents. Then, from the documents retrieved, we mine a set of positive documents. Thirdly, with the help of these positive documents, more positive documents could be extracted from the unlabeled documents. And finally, we train a

PU

text classifier with these positive documents and unlabeled documents. Our experiment result on 20Newsgroup dataset shows that the proposed approach could help to build excellent text classifiers.

Qiang Qiu, Yang Zhang, Junping Zhu
A Discriminative Approach to Topic-Based Citation Recommendation

In this paper, we present a study of a novel problem, i.e. topic-based citation recommendation, which involves recommending papers to be referred to. Traditionally, this problem is usually treated as an engineering issue and dealt with using heuristics. This paper gives a formalization of topic-based citation recommendation and proposes a discriminative approach to this problem. Specifically, it proposes a two-layer Restricted Boltzmann Machine model, called RBM-CS, which can discover topic distributions of paper content and citation relationship simultaneously. Experimental results demonstrate that RBM-CS can significantly outperform baseline methods for citation recommendation.

Jie Tang, Jing Zhang
Romanization of Thai Proper Names Based on Popularity of Usages

The lack of standards for Romanization of Thai proper names makes searching activity a challenging task. This is particularly important when searching for people-related documents based on orthographic representation of their names using either solely Thai or English alphabets. Romanization based directly on the names’ pronunciations often fails to deliver exact English spellings due to the non-1-to-1 mapping from Thai to English spelling and personal preferences. This paper proposes a Romanization approach where popularity of usages is taken into consideration. Thai names are parsed into sequences of grams, units of syllable-sized or larger governed by pronunciation and spelling constraints in both Thai and English writing systems. A Gram lexicon is constructed from a corpus of more than 130,000 names. Statistical models are trained accordingly based on the Gram lexicon. The proposed method significantly outperformed the current Romanization approach. Approximately 46% to 75% of the correct English spellings are covered when the number of proposed hypotheses increases from 1 to 15.

Akegapon Tangverapong, Atiwong Suchato, Proadpran Punyabukkana
Budget Semi-supervised Learning

In this paper we propose to study

budget semi-supervised learning

, i.e., semi-supervised learning with a resource budget, such as a limited memory insufficient to accommodate and/or process all available unlabeled data. This setting is with practical importance because in most real scenarios although there may exist abundant unlabeled data, the computational resource that can be used is generally not unlimited. Effective budget semi-supervised learning algorithms should be able to adjust behaviors considering the given resource budget. Roughly, the more resource, the more exploitation on unlabeled data. As an example, in this paper we show that this is achievable by a simple yet effective method.

Zhi-Hua Zhou, Michael Ng, Qiao-Qiao She, Yuan Jiang
When does Co-training Work in Real Data?

Co-training, a paradigm of semi-supervised learning, may alleviate effectively the data scarcity problem (i.e., the lack of labeled examples) in supervised learning. The standard two-view co-training requires the dataset be described by two views of attributes, and previous theoretical studies proved that if the two views satisfy the sufficiency and independence assumptions, co-training is guaranteed to work well. However, little work has been done on how these assumptions can be empirically verified given datasets. In this paper, we first propose novel approaches to verify empirically the two assumptions of co-training based on datasets. We then propose simple heuristic to split a single view of attributes into two views, and discover regularity on the sufficiency and independence thresholds for the standard two-view co-training to work well. Our empirical results not only coincide well with the previous theoretical findings, but also provide a practical guideline to decide when co-training should work well based on datasets.

Charles X. Ling, Jun Du, Zhi-Hua Zhou
Classification of Audio Signals Using a Bhattacharyya Kernel-Based Centroid Neural Network

A novel approach for the classification of audio signals using a Bhattacharyya Kernel-based Centroid Neural Network (BK-CNN) is proposed and presented in this paper. The proposed classifier is based on Centroid Neural Network (CNN) and also exploits advantages of the kernel method for mapping input data into a higher dimensional feature space. Furthermore, since the feature vectors of audio signals are modelled by Gaussian Probability Density Function (GPDF), the classification procedure is performed by considering Bhattacharyya distance as the distance measure of the proposed classifier. Experiments and results on various audio data sets demonstrate that the proposed classification scheme based on BK-CNN outperforms conventional algorithms including Self-Organizing Map(SOM) and CNN.

Dong-Chul Park, Yunsik Lee, Dong-Min Woo
Sparse Kernel Learning and the Relevance Units Machine

The relevance vector machine(RVM) is a state-of-the-art constructing sparse regression kernel model [1,2,3,4]. It not only generates a much sparser model but provides better generalization performance than the standard support vector machine (SVM). In RVM and SVM, relevance vectors (RVs) and support vectors (SVs) are both selected from the input vector set. This may limit model flexibility. In this paper we propose a new sparse kernel model called Relevance Units Machine (RUM). RUM follows the idea of RVM under the Bayesian framework but releases the constraint that RVs have to be selected from the input vectors. RUM treats relevance units as part of the parameters of the model. As a result, a RUM maintains all the advantages of RVM and offers superior sparsity. The new algorithm is demonstrated to possess considerable computational advantages over well-known the state-of-the-art algorithms.

Junbin Gao, Jun Zhang
Pairwise Constrained Clustering for Sparse and High Dimensional Feature Spaces

Clustering high dimensional data with sparse features is challenging because pairwise distances between data items are

not

informative in high dimensional space. To address this challenge, we propose two novel semi-supervised clustering methods that incorporate prior knowledge in the form of pairwise cluster membership constraints. In particular, we project high-dimensional data onto a much reduced-dimension subspace, where rough clustering structure defined by the prior knowledge is strengthened. Metric learning is then performed on the subspace to construct more informative pairwise distances. We also propose to propagate constraints locally to improve the informativeness of pairwise distances. When the new methods are evaluated using two real benchmark data sets, they show substantial improvement using only limited prior knowledge.

Su Yan, Hai Wang, Dongwon Lee, C. Lee Giles
Clustering Documents Using a Wikipedia-Based Concept Representation

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.

Anna Huang, David Milne, Eibe Frank, Ian H. Witten
An Instantiation of Hierarchical Distance-Based Conceptual Clustering for Propositional Learning

In this work we analyse the relationship between distance and generalisation operators for real numbers, nominal data and tuples in the context of hierarchical distance-based conceptual clustering (HDCC). HDCC is a general approach to conceptual clustering that extends the traditional algorithm for hierarchical clustering by producing conceptual generalisations of the discovered clusters. This makes it possible to combine the flexibility of changing distances for several clustering problems and the advantage of having concepts which are crucial for tasks as summarisation and descriptive data mining in general. In this work we propose a set of generalisation operators and distances for the data types mentioned before and we analyse the properties by them satisfied on the basis of three different levels of agreement between the clustering hierarchy obtained from the linkage distance and the hierarchy obtained by using generalisation operators.

Ana Funes, Cesar Ferri, Jose Hernández-Orallo, Maria José Ramírez-Quintana
Computing Substitution Matrices for Genomic Comparative Analysis

Substitution matrices describe the rates of mutating one character in a biological sequence to another character, and are important for many knowledge discovery tasks such as phylogenetic analysis and sequence alignment. Computing substitution matrices for very long genomic sequences of divergent or even unrelated species requires sensitive algorithms that can take into account differences in composition of the sequences. We present a novel algorithm that addresses this by computing a nucleotide substitution matrix specifically for the two genomes being aligned. The method is founded on information theory and in the expectation maximisation framework. The algorithm iteratively uses compression to align the sequences and estimates the matrix from the alignment, and then applies the matrix to find a better alignment until convergence. Our method reconstructs, with high accuracy, the substitution matrix for synthesised data generated from a known matrix with introduced noise. The model is then successfully applied to real data for various malaria parasite genomes, which have differing phylogenetic distances and composition that lessens the effectiveness of standard statistical analysis techniques.

Minh Duc Cao, Trevor I. Dix, Lloyd Allison
Mining Both Positive and Negative Impact-Oriented Sequential Rules from Transactional Data

Traditional sequential pattern mining deals with positive correlation between sequential patterns only, without considering negative relationship between them. In this paper, we present a notion of

impact-oriented negative sequential rules

, in which the left side is a positive sequential pattern or its negation, and the right side is a predefined outcome or its negation.

Impact-oriented negative sequential rules

are formally defined to show the impact of sequential patterns on the outcome, and an efficient algorithm is designed to discover both positive and negative impact-oriented sequential rules. Experimental results on both synthetic data and real-life data show the efficiency and effectiveness of the proposed technique.

Yanchang Zhao, Huaifeng Zhang, Longbing Cao, Chengqi Zhang, Hans Bohlscheid
Aggregated Subset Mining

The usual data mining setting uses the full amount of data to derive patterns for different purposes. Taking cues from machine learning techniques, we explore ways to divide the data into subsets, mine patterns on them and use post-processing techniques for acquiring the result set. Using the patterns as features for a classification task to evaluate their quality, we compare the different subset compositions, and selection techniques. The two main results – that small independent sets are better suited than large amounts of data, and that uninformed selection techniques perform well – can to a certain degree be explained by quantitative characteristics of the derived pattern sets.

Albrecht Zimmermann, Björn Bringmann
Hot Item Detection in Uncertain Data

An object

o

of a database

$\mathcal{D}$

is called a

hot item

, if there is a sufficiently large population of other objects in

$\mathcal{D}$

that are similar to

o

. In other words,

hot items

are objects within a dense region of other objects and provide a basis for many density-based data mining techniques. Intuitively, objects that share their attribute values with a lot of other objects could be potentially interesting as they show a typical occurrence of objects in the database. Also, there are a lot of application domains, e.g. sensor databases, traffic management or recognition systems, where objects have vague and uncertain attributes. We propose an approach for the detection of potentially interesting objects (

hot items

) of an uncertain database in a probabilistic way. An efficient algorithm is presented which detects

hot items

, where to each object

o

a confidence value is assigned that reflects the likelihood that

o

is a

hot item

. In an experimental evaluation we show that our method can compute the results very efficiently compared to its competitors.

Thomas Bernecker, Hans-Peter Kriegel, Matthias Renz, Andreas Zuefle
Spanning Tree Based Attribute Clustering

Attribute clustering has been previously employed to detect statistical dependence between subsets of variables. We propose a novel attribute clustering algorithm motivated by research of complex networks, called the Star Discovery algorithm. The algorithm partitions and indirectly discards inconsistent edges from a maximum spanning tree by starting appropriate initial modes, therefore generating stable clusters. It discovers sound clusters through simple graph operations and achieves significant computational savings. We compare the Star Discovery algorithm against earlier attribute clustering algorithms and evaluate the performance in several domains.

Yifeng Zeng, Jorge Cordero Hernandez, Shuyuan Lin
The Effect of Varying Parameters and Focusing on Bus Travel Time Prediction

Travel time prediction is an important tool for the planning tasks of mass transit and logistics companies. In this paper we investigate the use of regression methods for the problem of predicting the travel time of buses in a Portuguese public transportation company. More specifically, we empirically evaluate the impact of varying parameters on the performance of different regression algorithms, such as support vector machines (SVM), random forests (RF) and projection pursuit regression (PPR). We also evaluate the impact of the focusing tasks (example selection, domain value definition and feature selection) in the accuracy of those algorithms. Concerning the algorithms, we observe that 1) RF is quite robust to the choice of parameters and focusing methods; 2) the choice of parameters for SVM can be made independently of focusing methods while 3) for PPR they should be selected simultaneously. For the focusing methods, we observe that a stronger effect is obtained using example selection, particularly in combination with SVM.

João M. Moreira, Carlos Soares, Alípio M. Jorge, Jorge Freire de Sousa
Transfer Learning Action Models by Measuring the Similarity of Different Domains

AI planning requires action models to be given in advance. However, it is both time consuming and tedious for a human to encode the action models by hand using a formal language such as PDDL, as a result, learning action models is important for AI planning. On the other hand, the data being used to learn action models are often limited in planning domains, which makes the learning task very difficult. In this paper, we present a new algorithm to learn action models from plan traces by transferring useful information from other domains whose action models are already known. We present a method of building a metric to measure the shared information and transfer this information according to this metric. The larger the metric is, the bigger the information is transferred. In the experiment result, we show that our proposed algorithm is effective.

Hankui Zhuo, Qiang Yang, Lei Li
On Optimal Rule Mining: A Framework and a Necessary and Sufficient Condition of Antimonotonicity

Many studies have shown the limits of support/confidence framework used in Apriori-like algorithms to mine association rules. There are a lot of efficient implementations based on the antimonotony property of the support but candidate set generation is still costly. In addition many rules are uninteresting or redundant and one can miss interesting rules like nuggets. One solution is to get rid of frequent itemset mining and to focus as soon as possible on interesting rules. For that purpose algorithmic properties were first studied, especially for the confidence. They allow all confidence rules to be found without a preliminary support pruning. More recently, in the case of class association rules, the concept of optimal rules gave a pruning strategy compatible with more measures. However, all these properties have been demonstrated for a limited number of interestingness measures. We present a new formal framework which allows us to make the link between analytic and algorithmic properties of the measures. We apply this framework to optimal rules, and we demonstrate a necessary and sufficient condition of existence for this pruning strategy, which can be applied to any measure.

Yannick Le Bras, Philippe Lenca, Stéphane Lallich
Discovering Action Rules That Are Highly Achievable from Massive Data

In this paper, we propose a novel algorithm which discovers a set of action rules for converting negative examples into positive examples. Unlike conventional action rule discovery methods, our method AARUDIA (Achievable Action RUle DIscovery Algorithm) considers the effects of actions and the achievability of the class change for disk-resident data. In AARUDIA, effects of actions are specified using domain rules and the achievability is inferred with Naive Bayes classifiers. AARUDIA takes a new breadth-first search method which manages actionable literals and stable literals, and exploits the achievability to reduce the number of discovered rules. Experimental results with inflated real-world data sets are promising and demonstrate the practicality of AARUDIA.

Einoshin Suzuki
Extracting Fuzzy Rules for Detecting Ventricular Arrhythmias Based on NEWFM

In the heart disease, the important problem of ECG arrhythmia is to discriminate ventricular arrhythmias from normal cardiac rhythm. This paper presents novel method based on the neural network with weighted fuzzy membership functions (NEWFM) for the discrimination of ventricular tachycardia (VT) and ventricular fibrillation (VF) from normal sinus rhythm (NSR). This paper uses two pre-processes, the Haar wavelet function and extraction feature method are carried out in order. By using these methods, six features can be generated, which are the input data of NEWFM. NEWFM classifies NSR and VT/VF beats by the trained bounded sum of weighted fuzzy membership functions (BSWFMs) using six input features from the Creighton University Ventricular Tachyarrhythmia Data Base (CUDB). The results are better than Amann’s phase space reconstruction (PSR) algorithm, accuracy and specificity rates of 90.4% and 93.3%, respectively.

Dong-Kun Shin, Sang-Hong Lee, Joon S. Lim
Trace Mining from Distributed Assembly Databases for Causal Analysis

Hierarchical structures of components often appear in industry, such as the components of cars. We focus on association mining from the hierarchically assembled data items that are characterized with identity labels such as lot numbers. Massive and physically distributed product databases make it difficult to directly find the associations of deep-level items. We propose a top-down algorithm using virtual lot numbers to mine association rules from the hierarchical databases. Virtual lot numbers delegate the identity information of the subcomponents to upper-level lot numbers without modifications to the databases. Our pruning method reduces the number of enumerated items and avoids redundant access to the databases. Experiments show that the algorithm works an order of magnitude faster than a naive approach.

Shohei Hido, Hirofumi Matsuzawa, Fumihiko Kitayama, Masayuki Numao
Let’s Tango – Finding the Right Couple for Feature-Opinion Association in Sentiment Analysis

One approach in opinion mining is to perform sentiment classification at the sentence level. User’s view on a discovered product feature is predicted by the opinion words, e.g. adjectives, appeared in the same sentence. A number of previous works has been proposed and these approaches typically treat the feature and word relations identically. Blindly using sentiments of all opinion words to perform classification would lead to false results. In this paper, we investigate the relationship between features and opinion words using the corpus-based approach. We proposed a Feature-Opinion Association (FOA) algorithm to match these two in sentences to improve sentiment analysis results. We construct a feature-based sentiment lexicon using the proposed algorithm in the sentiment identification process. Extensive experiments based on a commercial product review site show that our method is quite effective in obtaining a more accurate result.

Kam Tong Chan, Irwin King
An Efficient Candidate Pruning Technique for High Utility Pattern Mining

High utility pattern mining extracts more useful and realistic knowledge from transaction databases compared to the traditional frequent pattern mining by considering the non-binary frequency values of items in transactions and different profit values for every item. However, the existing high utility pattern mining algorithms suffer from the level-wise candidate generation-and-test problem and need several database scans to mine the actual high utility patterns. In this paper, we propose a novel tree-based candidate pruning technique HUC-Prune (high utility candidates prune) to efficiently mine high utility patterns without level-wise candidate generation-and-test. It exploits a pattern growth mining approach and needs maximum three database scans in contrast to several database scans of the existing algorithms. Extensive experimental results show that our technique is very efficient for high utility pattern mining and it outperforms the existing algorithms.

Chowdhury Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong, Young-Koo Lee
Grouped ECOC Conditional Random Fields for Prediction of Web User Behavior

Web page prefetching has shown to provide reduction in Web access latency, but is highly dependent on the accuracy of the Web page prediction method. Conditional Random Fields (CRFs) with Error Correcting Output Coding (ECOC) have shown to provide highly accurate and efficient Web page prediction on large-size websites. However, the limited class information provided to the binary-label sub-CRFs in ECOC-CRFs will also lead to inferior accuracy when compared to the single multi-label CRFs. Although increasing the minimum Hamming distance of the ECOC matrix can help to improve the accuracy of ECOC-CRFs, it is still not an ideal method. In this paper, we introduce the grouped ECOC-CRFs that allow us to obtain a prediction accuracy closer to that of single multi-label CRFs by grouping the binary ECOC vectors. We show in our experiments that by using the grouping method, we can maintain the efficiency of the ECOC-CRFs while providing significant increase in Web page prediction accuracy over ECOC-CRFs.

Yong Zhen Guo, Kotagiri Ramamohanarao, Laurence A. F. Park
CLHQS: Hierarchical Query Suggestion by Mining Clickthrough Log

Most commercial search engines provide query suggestion in a ranked list for more effective search. However, a ranked list may not be an ideal way to satisfy users’ various information demands. In this paper, we propose a novel query suggestion method named CLHQS (Clickthrough-Log based Hierarchical Query Suggestion). It organizes the suggested queries into a well-structured hierarchy. Users can easily generalize, extend or specialize their queries within the hierarchy. The query hierarchy is mined from the clickthrough log data in the following way. First, we generate a candidate set through the query-url graph analysis. Second, the

pair-wise

relationships are inspected for each pair of candidate queries. Finally, we construct the suggested query hierarchy using these relationships. Experiments on a real-world clickthrough log validate the effectiveness of our proposed CLHQS approach.

Depin Chen, Ning Liu, Zhijun Yin, Yang Tong, Jun Yan, Zheng Chen
X-Tracking the Changes of Web Navigation Patterns

In this paper, we firstly investigate the possible changes on web usage behaviors and then propose an x-tracking method to detect these changes. The changes on web navigation patterns are depicted from microscopic and macroscopic levels: the former is for the “internal” and “external” changes which show the variations on the semantics and external physical features respectively, while the latter is modeled for the changes of the popularity on “local” and “global” time line. The x-tracking method we propose is to detect the newly emerged patterns (EP) based on “internal” feature, which is the premise to compute the changes on other features by tracking their internal unchanged patterns (IUP). Experiments show that the x-tracked changes are condensed, efficient and informative.

Long Wang, Christoph Meinel
Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models

One important problem proposed recently in the field of web mining is website classification problem. The complexity together with the necessity to have accurate and fast algorithms yield to many attempts in this field, but there is a long way to solve these problems efficiently, yet. The importance of the problem encouraged us to work on a new approach as a solution. We use the content of web pages together with the link structure between them to improve the accuracy of results. In this work we use Naïve-bayes models for each predefined webpage class and an extended version of Hidden Markov Model is used as website class models. A few sample websites are adopted as seeds to calculate models’ parameters. For classifying the websites we represent them with tree structures and we modify the Viterbi algorithm to evaluate the probability of generating these tree structures by every website model. Because of the large amount of pages in a website, we use a sampling technique that not only reduces the running time of the algorithm but also improves the accuracy of the classification process. At the end of this paper, we provide some experimental results which show the performance of our algorithm compared to the previous ones.

Majid Yazdani, Milad Eftekhar, Hassan Abolhassani
Emotion Recognition of Pop Music Based on Maximum Entropy with Priors

Efficient and intelligent music retrieval has become a very important topic nowadays. Analysis of lyrics must be a complement of acoustic methods for music retrieval. One basic aspect of music retrieval is music emotion recognition by learning from lyrics. This problem is different from traditional text classification in that more linguistic or semantic information is required for better emotion analysis. Thereby, we focus on how to extract meaningful features and how to modeling them for music emotion recognition. First, we investigate the lyrics corpus based on Zipf’s Law using word as a unit, and results roughly obey Zipf’s Law. Then, we study three kinds of preprocessing methods and a series of language grams under the well-known n-gram language model framework to extract more semantic features. At last, we employ three supervised learning methods, Naïve Bayes, maximum entropy classification, and support vector machine, to examine the classification performance. Besides that, we also improve ME with Gaussian and Laplace priors to model features for music emotion recognition. Experiment al results show that feature extraction methods improved music emotion recognition accuracy. ME with priors obtained the best.

Hui He, Bo Chen, Jun Guo
Simultaneously Finding Fundamental Articles and New Topics Using a Community Tracking Method

In this paper, we study the relationship between fundamental articles and new topics and present a new method to detect recently formed topics and its typical articles simultaneously. Based on community partition, the proposed method first identifies the emergence of a new theme by tracking the change of the community where the top cited nodes lie. Next, the paper with a high citation number belonging to this new topic is recognized as a fundamental article. Experimental results on real dataset show that our method can detect new topics with only a subset of data in a timely manner, and the identified papers for these topics are found to have a long lifespan and keep receiving citations in the future.

Tieyun Qian, Jaideep Srivastava, Zhiyong Peng, Phillip C. Y. Sheu
Towards a Novel Association Measure via Web Search Results Mining

Web-based association measure aims to evaluate the semantic similarity between two queries (i.e. words or entities) by leveraging the search results returned by search engines. Existing web-relevance similarity measure usually considers all search results for a query as a coarse-grained single topic and measures the similarity between the term vectors constructed by concatenating all search results into a single document for each query. This paper proposes a novel association measure named WSRCM based on web search results clustering and matching to evaluate the semantic similarity between two queries at a fine-grained level. WSRCM first discovers the subtopics in the search results for each query and then measures the consistency between the sets of subtopics for two queries. Each subtopic for a query is expected to describe a unique facet of the query, and two queries sharing more subtopics are deemed more semantically related. Experimental results demonstrate the encouraging performance of the proposed measure.

Xiaojun Wan, Jianguo Xiao
A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data

Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world KDD applications. Existing outlier detection methods are ineffective on scattered real-world datasets due to implicit data patterns and parameter setting issues. We define a novel

Local Distance-based Outlier Factor

(LDOF) to measure the outlier-ness of objects in scattered datasets which addresses these issues. LDOF uses the relative location of an object to its neighbours to determine the degree to which the object deviates from its neighbourhood. We present theoretical bounds on LDOF’s false-detection probability. Experimentally, LDOF compares favorably to classical KNN and LOF based outlier detection. In particular it is less sensitive to parameter values.

Ke Zhang, Marcus Hutter, Huidong Jin
Mining Outliers with Faster Cutoff Update and Space Utilization

It is desirable to find unusual data objects by Ramaswamy et al’s distance-based outlier definition because only a metric distance function between two objects is required. It does not need any neighborhood distance threshold required by many existing algorithms based on the definition of Knorr and Ng. Bay and Schwabacher proposed an efficient algorithm ORCA, which can give near linear time performance, for this task. To further reduce the running time, we propose in this paper two algorithms RC and RS using the following two techniques respectively: (i) faster cutoff update, and (ii) space utilization after pruning. We tested RC, RS and RCS (a hybrid approach combining both RC and RS) on several large and high-dimensional real data sets with millions of objects. The experiments show that the speed of RCS is as fast as 1.4 to 2.3 times that of ORCA, and the improvement of RCS is relatively insensitive to the increase in the data size.

Chi-Cheong Szeto, Edward Hung
Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data

We propose an original outlier detection schema that detects outliers in varying subspaces of a high dimensional feature space. In particular, for each object in the data set, we explore the axis-parallel subspace spanned by its neighbors and determine how much the object deviates from the neighbors in this subspace. In our experiments, we show that our novel subspace outlier detection is superior to existing full-dimensional approaches and scales well to high dimensional databases.

Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek
K-Dominant Skyline Computation by Using Sort-Filtering Method

Skyline queries are useful in many applications such as multi-criteria decision making, data mining, and user preference queries. A skyline query returns a set of interesting data objects that are not dominated in all dimensions by any other objects. For a high-dimensional database, sometimes it returns too many data objects to analyze intensively. To reduce the number of returned objects and to find more important and meaningful objects, we consider a problem of

k

-dominant skyline queries. Given an

n

-dimensional database, an object

p

is said to

k

-dominates another object

q

if there are

$(\textbf{k} {\leq} \textbf{n})$

dimensions in which

p

is better than or equal to

q

. A

k

-dominant skyline object is an object that is not

k

-dominated by any other objects. In contrast, conventional skyline objects are

n

-dominant objects. We propose an efficient method for computing

k

-dominant skyline queries. Intensive performance study using real and synthetic datasets demonstrated that our method is efficient and scalable.

Md. Anisuzzaman Siddique, Yasuhiko Morimoto
Effective Boosting of Naïve Bayesian Classifiers by Local Accuracy Estimation

This paper investigates an effective boosting method for naïve Bayesian classifiers. Existing work has shown that the boosted naïve Bayesian classifier is not so effective in error rate reduction as the boosted decision tree (or boosted decision stump). This phenomenon may be caused by the combination of a couple of facts. To solve the problem, the local accuracies of a naïve Bayesian base classifier should be used to replace the global accuracy (or global error rate) in the traditional boosting methods. Based on the analysis, we propose an effective boosted naïve Bayesian method which uses a C4.5 decision tree as the local-accuracy evaluator for each base classifier. At each round, two classifiers are constructed: one for the naïve Bayesian base classifier, while the other for the C4.5 evaluator. The estimated local accuracy plays an important role, not only in updating the weights of training examples but also in determining the vote weights of base classifiers. Finally, it has been shown by experimental comparison that our method has achieved much lower error rate on average in a set of domains than the AdaBoost.M1 of naïve Bayesian classifiers.

Zhipeng Xie
COMUS: Ontological and Rule-Based Reasoning for Music Recommendation System

In this paper, we propose Context-based Music Recommendation (COMUS) ontology for modeling user’s musical preferences and context for supporting reasoning about the user’s desired emotion and preferences. The COMUS provides an upper Music Ontology that captures concepts about the general properties of music such as title, artists and genre and also provides extensibility for adding domain-specific ontologies, such as Music Feature, Mood and Situation, in a hierarchical manner. The COMUS is music dedicated ontology in OWL constructed by incorporating domain specific classes for music recommendation into the Music Ontology. Using this context ontology, we believe that the use of logical reasoning rules by checking the consistency of context information, and reasoning over the high-level, implicit context from the low-level, explicit information. As a novelty, our ontology can express detailed and complicated relations among the music, moods and situations, enabling users to find appropriate music for the application. We present some of the experiments we performed as a case-study for music recommendation.

Seungmin Rho, Seheon Song, Eenjun Hwang, Minkoo Kim
Spatial Weighting for Bag-of-Visual-Words and Its Application in Content-Based Image Retrieval

It is a challenging and important task to retrieve images from a large and highly varied image data set based on their visual contents. Problems like how to fill the semantic gap between image features and the user have attracted a lot of attention from the research community. Recently, the ’bag of visual words’ approach exhibits very good performance in content-based image retrieval (CBIR). However, since the ’bag of visual words’ approach represents an image as an unordered collection of local descriptors which only use the intensity information, the resulting model provides little insight about the spatial constitution and color information of the image. In this paper, we develop a novel image representation method which uses Gaussian mixture model (GMM) to provide spatial weighting for visual words and apply this method to facilitate content based image retrieval. Our approach is a simple and more efficient compared with the order-less ’bag of visual words’ approach. In our method, firstly, we extract visual tokens from the image data set and cluster them into a lexicon of visual words. Then, we represent the spatial constitution of an image as a mixture of

n

Gaussians in the feature space and decompose the image into

n

regions. The spatial weighting scheme is achieved by weighting visual words according to the probability of each visual word belonging to each of the

n

regions in the image. The cosine similarity between spatial weighted visual word vectors is used as distance measurement between regions, while the image-level distance is obtained by averaging the pair-wise distances between regions. We compare the performance of our method with the traditional ’bag of visual words’ and ’blobworld’ approaches under the same image retrieval scenario. Experimental results demonstrate that the our method is able to tell images apart in the semantic level and improve the performance of CBIR.

Xin Chen, Xiaohua Hu, Xiajiong Shen
Item Preference Parameters from Grouped Ranking Observations

Given a set of rating data for a set of items, determining the values of items is a matter of importance. Various probability models have been proposed to solve this task. The Plackett-Luce model is one of such models, which parametrizes the value of each item by a real valued preference parameter. In this paper, the Plackett-Luce model is generalized to cope with the grouped ranking observations such as movies or restaurants ratings. Since the maximization of the likelihood of the proposed model is computationally intractable, the lower bound of the likelihood which is easy to evaluate is derived, and the

em

algorithm is adopted to the maximization of the lower bound.

Hideitsu Hino, Yu Fujimoto, Noboru Murata
Cross-Channel Query Recommendation on Commercial Mobile Search Engine: Why, How and Empirical Evaluation

Mobile search not only inherits some features of traditional search on PC, but also has many of its own special characteristics. In this paper, we firstly share some unique features about mobile search and discuss why vertical search is preferred. Providing multiple vertical searches is proved to be convenient to users but causes some minor problem as well. This plays as the initiative for us to propose cross-channel query recommendation. Secondly, we briefly introduce how to realize the cross-channel recommendation effectively and efficiently online. Finally, we analyze the performance of the proposed method from three different but related metrics: expected effect, off-line evaluation and on-line evaluation. All three studies together indicate that the proposed cross-channel recommendation is quite useful. Being the first study about query recommendation on mobile search, it is believed that the findings, proposed solution and collected feedback as presented here will be beneficial to both researchers and industry companies while considering how to provide better mobile search service.

Shunkai Fu, Bingfeng Pi, Ying Zhou, Michel C. Desmarais, Weilei Wang, Song Han, Xunrong Rao
Data Mining for Intrusion Detection: From Outliers to True Intrusions

Data mining for intrusion detection can be divided into several sub-topics, among which unsupervised clustering has controversial properties. Unsupervised clustering for intrusion detection aims to i) group behaviors together depending on their similarity and ii) detect groups containing only one (or very few) behaviour. Such isolated behaviours are then considered as deviating from a model of normality and are therefore considered as malicious. Obviously, all atypical behaviours are not attacks or intrusion attempts. Hence, this is the limits of unsupervised clustering for intrusion detection. In this paper, we consider to add a new feature to such isolated behaviours before they can be considered as malicious. This feature is based on their possible repetition from one information system to another.

Goverdhan Singh, Florent Masseglia, Céline Fiot, Alice Marascu, Pascal Poncelet
A Multi-resolution Approach for Atypical Behaviour Mining

Atypical behaviours are the basis of a valuable knowledge in domains related to security (

e.g.

fraud detection for credit card [1], cyber security [4] or safety of critical systems [6]). Atypicity generally depends on the isolation level of a (set of) records, compared to the dataset. One possible method for finding atypic records aims to perform two steps. The first step is a clustering (grouping the records by similarity) and the second step is the identification of clusters that do not correspond to a satisfying number of records. The main problem is to adjust the method and find the good level of atypicity. This issue is even more important in the domain of data streams, where a decision has to be taken in a very short time and the end-user does not want to try several settings. In this paper, we propose

Mrab

, a self-adjusting approach intending to automatically discover atypical behaviours (in the results of a clustering algorithm) without any parameter. We provide the formal framework of our method and our proposal is tested through a set of experiments.

Alice Marascu, Florent Masseglia
Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions

Detecting changes in spatial datasets is important for many fields. In this paper, we introduce a methodology for change analysis in spatial datasets that combines contouring algorithms with supervised density estimation techniques. The methodology allows users to define their own criteria for features of interest and to identify changes in those features between two datasets. Change analysis is performed by comparing interesting regions that have been derived using contour clustering. A novel clustering algorithm called DCONTOUR is introduced for this purpose that computes contour polygons that describe the boundary of a supervised density function at a given density threshold. Relationships between old and new data are analyzed relying on polygon operations. We evaluate our methodology in a case study that analyzes changes in earthquake patterns.

Chun Sheng Chen, Vadeerat Rinsurongkawong, Christoph F. Eick, Michael D. Twa
Centroid Neural Network with Spatial Constraints

A Centroid Neural Network with spatial constraints (CNN-S) is proposed in this paper. The spatial constraints are applied to the Centroid Neural Network(CNN) algorithm to reduce noise in Magnetic Resonance(MR) images. MR image segmentation is performed to illustrate the application of the proposed algorithm. The proposed algorithm incorporates a novel approach of using the weights of attributes to evaluate the roles of the latter. Experiments and results on MR images from Internet Brain Segmentation Repository(IBSR) show that the proposed algorithm provides a superior performance over other algorithms such as Fuzzy C-Means(FCM)and Fuzzy C-Means with spatial constraints(FCM-S).

Dong-Chul Park
Diversity in Combinations of Heterogeneous Classifiers

In this paper, we introduce the use of combinations of heterogeneous classifiers to achieve better diversity. Conducting theoretical and empirical analyses of the diversity of combinations of heterogeneous classifiers, we study the relationship between heterogeneity and diversity. On the one hand, the theoretical analysis serves as a foundation for employing heterogeneous classifiers in Multi-Classifier Systems or ensembles. On the other hand, experimental results provide empirical evidence. We consider synthetic as well as real data sets, utilize classification algorithms that are essentially different, and employ various popular diversity measures for evaluation. Two interesting observations will contribute to the future design of Multi-Classifier Systems and ensemble techniques. First, the diversity among heterogeneous classifiers is higher than that among homogeneous ones, and hence using heterogeneous classifiers to construct classifier combinations would increase the diversity. Second, the heterogeneity primarily results from different classification algorithms rather than the same algorithm with different parameters.

Kuo-Wei Hsu, Jaideep Srivastava
Growth Analysis of Neighbor Network for Evaluation of Damage Progress

We constructed two types of neighbor networks, i.e., TOP

k

and

k

-MNN (Mutually Nearest Neighbor) networks, on observed AE (Acoustic Emission) data produced by damages in SOFC (Solid Oxide Fuel Cell). Afterwards, we analyzed growth properties of the neighbor networks for evaluation of damage progress. The results show that the power index of degree dynamically changes as damage progress phase changes. Also we found the decrease of cluster coefficient and shrinking effective diameter in k-MNN network reflect the occurrence of various combination of damages.

Ken-ichi Fukui, Kazuhisa Sato, Junichiro Mizusaki, Kazumi Saito, Masahiro Kimura, Masayuki Numao
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

In this paper, a simple but powerful algorithm: block co-citation algorithm is proposed to automatically find related pages for a given web page, by using HTML segmentation technologies and parallel hyperlink structure analysis. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block of the page is computed according to several information, then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. At last, the block co-citation algorithm is implemented in parallel to analyze a corpus of 37482913 pages sampled from a commercial search engine and demonstrates its feasibility and efficiency.

Xiaoyan Shen, Junliang Chen, Xiangwu Meng, Yujie Zhang, Chuanchang Liu
Boosting Biomedical Information Retrieval Performance through Citation Graph: An Empirical Study

This paper presents an empirical study of the combination of content-based information retrieval results with linkage-based document importance scores to improve retrieval performance on TREC biomedical literature datasets. In our study, content-based information comes from the state-of-the-art probability model based Okapi information retrieval system. On the other hand, linkage-based information comes from a citation graph generated from

REFERENCES

sections of a biomedical literature dataset. Three well-known linkage-based ranking algorithms (PageRank, HITS and InDegree) are applied on the citation graph to calculate document importance scores. We use TREC 2007 Genomics dataset for evaluation, which contains 162,259 biomedical literatures. Our approach achieves the best document-based MAP among all results that have been reported so far. Our major findings can be summarized as follows. First, without hyperlinks, linkage information extracted from

REFERENCES

sections can be used to improve the effectiveness of biomedical information retrieval. Second, performance of the integrated system is sensitive to linkage-based ranking algorithms, and a simpler algorithm, InDegree, is more suitable for biomedical literature retrieval.

Xiaoshi Yin, Xiangji Huang, Qinmin Hu, Zhoujun Li
Similarity-Based Feature Selection for Learning from Examples with Continuous Values

In many real world problems, such as machine learning and data mining, feature selection is often used to choose a small subset of features which is sufficient to predict the target labels well. In this paper, we will propose a feature selection algorithm based on similarity and extension matrix. Extension matrix is an important theory in learning from examples and it is originally designed to deal with discrete feature values. However, in the paper it is extended to cope with continuous values and designed as search strategy. The evaluation criterion for feature selection is based on the similarity between classes, which is obtained from the similarity between examples in different classes using min-max learning rule. The algorithm is proved in theory and shown its higher performance than two other classic general algorithms over several real-world benchmark data sets and facial image sets with different poses and expressions for gender classification.

Yun Li, Su-Jun Hu, Wen-Jie Yang, Guo-Zi Sun, Fang-Wu Yao, Geng Yang
Application-Independent Feature Construction from Noisy Samples

When training classifiers, presence of noise can severely harm their performance. In this paper, we focus on “non-class” attribute noise and we consider how a frequent fault-tolerant (FFT) pattern mining task can be used to support noise-tolerant classification. Our method is based on an application independent strategy for feature construction based on the so-called

δ

-free patterns. Our experiments on noisy training data shows accuracy improvement when using the computed features instead of the original ones.

Dominique Gay, Nazha Selmaoui, Jean-François Boulicaut
Estimating Optimal Feature Subsets Using Mutual Information Feature Selector and Rough Sets

Mutual Information (MI) is a good selector of relevance between input and output feature and have been used as a measure for ranking features in several feature selection methods. Theses methods cannot estimate optimal feature subsets by themselves, but depend on user defined performance. In this paper, we propose estimation of optimal feature subsets by using rough sets to determine candidate feature subset which receives from MI feature selector. The experiment shows that we can correct nonlinear problems and problems in situation of two or more combined features are dominant features, maintain an improve classification accuracy.

Sombut Foitong, Pornthep Rojanavasu, Boonwat Attachoo, Ouen Pinngern
Speeding Up Similarity Search on a Large Time Series Dataset under Time Warping Distance

Time series data are a ubiquitous data type appearing in many domains such as statistics, finance, multimedia, etc. Similarity search and measurement on time series data are typically different from on other data types since time series data have the associations among adjacent dimensions. Accordingly, the classic Euclidean distance metric is not an accurate similarity measure for time series. Therefore, Dynamic Time Warping (DTW) has become a better choice for similarity measurement on time series in various applications regardless of its high computational cost. To speed up the calculation, many research works attempt to speed up DTW calculation using indexing method, which always has a tradeoff between indexing efficiency and I/O cost. In this paper, we propose a novel method to balance this tradeoff under indexed sequential access using Sequentially Indexed Structure (SIS), an approach to time series indexing with low computational cost and small overheads on I/O. Finally, we conduct experiments to demonstrate our superiority in speed performance over the best existing method.

Pongsakorn Ruengronghirunya, Vit Niennattrakul, Chotirat Ann Ratanamahatana
A Novel Fractal Representation for Dimensionality Reduction of Large Time Series Data

Recent research has attempted to speed up time series data mining tasks which focus on dimensionality reduction, indexing, and lower bounding function, among many others. For large time series data, current dimensionality reduction techniques cannot reduce the total dimensions of time series data by a large margin without losing their global characteristics. In this paper, we introduce a novel Fractal Representation which uses merely three real values to represent a whole time series data sequence. Moreover, our proposed representation can be efficiently used under Euclidean distance. We demonstrate effectiveness and utility of our novel Fractal Representation on classification problems and our proposed method outperforms existing methods in terms of speed performance and accuracy. Our results reconfirm that this representation can effectively represent global characteristics of the data, especially in larger time series data.

Poat Sajjipanon, Chotirat Ann Ratanamahatana
Clustering Data Streams in Optimization and Geography Domains

In this paper, we formulate a dual clustering problem in spatial data streams. A spatial data stream consists of data points with attributes in the optimization and geography domains. We aim at partitioning these objects into disjoint clusters such that at each time window (1) objects in the same cluster satisfy the transitively r-connected relation in the optimization and geography domains, and (2) the number of clusters is as minimal as possible. We propose a Hierarchical-Based Clustering algorithm (HBC). Specifically, objects are represented as a graph structure, called RGraph, where each node represents an object and edges indicate their similarity relationships. In light of RGraph, algorithm HBC iteratively merges clusters. Experimental results show the performance of the algorithm.

Ling-Yin Wei, Wen-Chih Peng
CBDT: A Concept Based Approach to Data Stream Mining

Data Stream mining presents unique challenges compared to traditional mining on a random sample drawn from a stationary statistical distribution. Data from real-world data streams are subject to concept drift due to changes that take place continuously in the underlying data generation mechanism. Concept drift complicates the process of mining data as models that are learnt need to be updated continuously to reflect recent changes in the data while retaining relevant information that has been learnt from the past. In this paper, we describe a Concept Based Decision Tree (CBDT) learner and compare it with the CVDFT algorithm, which uses a sliding time window. Our experimental results show that CBDT outperforms CVFDT in terms of both classification accuracy and memory consumption.

Stefan Hoeglinger, Russel Pears, Yun Sing Koh
Meaningful Subsequence Matching under Time Warping Distance for Data Stream

Since the era of data explosion, research on mining data stream has become more and more active, particularly focusing on improving time and space complexity in similarity subsequence matching problems for data stream. Recently, SPRING algorithm and its variance have been proposed to solve the subsequence matching problem under time warping distance. Unfortunately, these algorithms produce meaningless results since no normalization is taken into account before distance calculation. In this work, we propose a novel subsequence matching algorithm which fully supports global constraint, uniform scaling, and normalization called MSM (Meaningful Subsequence Matching). As expected, our MSM algorithm is much faster and much more accurate than the current existing algorithms in terms of computational cost and accuracy by a very large margin.

Vit Niennattrakul, Chotirat Ann Ratanamahatana
An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise

Recent years have witnessed a large body of research work on mining concept drifting data streams, where a primary assumption is that the up-to-date data chunk and the yet-to-come data chunk share identical distributions, so classifiers with good performance on the up-to-date chunk would also have a good prediction accuracy on the yet-to-come data chunk. This “stationary assumption”, however, does not capture the concept drifting reality in data streams. More recently, a “learnable assumption” has been proposed and allows the distribution of each data chunk to evolve randomly. Although this assumption is capable of describing the concept drifting in data streams, it is still inadequate to represent real-world data streams which usually suffer from noisy data as well as the drifting concepts. In this paper, we propose a Realistic Assumption which asserts that the difficulties of mining data streams are mainly caused by both concept drifting and noisy data chunks. Consequently, we present a new Aggregate Ensemble (AE) framework, which trains base classifiers using different learning algorithms on different data chunks. All the base classifiers are then combined to form a classifier ensemble through model averaging. Experimental results on synthetic and real-world data show that AE is superior to other ensemble methods under our new realistic assumption for noisy data streams.

Peng Zhang, Xingquan Zhu, Yong Shi, Xindong Wu
On Pairwise Kernels: An Efficient Alternative and Generalization Analysis

Pairwise classification has many applications including network prediction, entity resolution, and collaborative filtering. The pairwise kernel has been proposed for those purposes by several research groups independently, and become successful in various fields. In this paper, we propose an efficient alternative which we call

Cartesian kernel

. While the existing pairwise kernel (which we refer to as Kronecker kernel) can be interpreted as the weighted adjacency matrix of the Kronecker product graph of two graphs, the Cartesian kernel can be interpreted as that of the Cartesian graph which is more sparse than the Kronecker product graph. Experimental results show the Cartesian kernel is much faster than the existing pairwise kernel, and at the same time, competitive with the existing pairwise kernel in predictive performance.We discuss the generalization bounds by the two pairwise kernels by using eigenvalue analysis of the kernel matrices.

Hisashi Kashima, Satoshi Oyama, Yoshihiro Yamanishi, Koji Tsuda
A Family-Based Evolutional Approach for Kernel Tree Selection in SVMs

Finding a kernel mapping function is a key step towards construction of a high-performanced SVM-based classifier. While some recent methods exploited an evolutional approach to construct a suitable multifunction kernel, most of them searched randomly and diversely. In this paper, the concept of a family of identical-structured kernel trees is proposed to enable exploration of structure space using genetic programming whereas to pursue investigation of parameter space on a certain tree using evolutional strategy. To control balance between structure and parameter search towards an optimal kernel, the trade-off strategy is introduced. By experiments on a number of benchmark datasets from UCI and text classification datasets, the proposed method is shown to be able to find a better optimal solution than other search methods.

Ithipan Methasate, Thanaruk Theeramunkong
An Online Incremental Learning Vector Quantization

As described in this paper, we propose online incremental learning vector quantization (ILVQ) for supervised classification tasks. As a prototype-based classifier, ILVQ needs no prior knowledge of the number of prototypes in the network or their initial value, as do most current prototype-based algorithms. It adopts a threshold-based insertion scheme to determine the number of prototypes needed for each class dynamically according to the distribution of training data. In addition, this insertion policy insures the fulfillment of the incremental learning goal, including both between-class incremental learning and within-class incremental learning. A technique for removing useless prototypes is used to eliminate noise interrupting the input data. Unlike other LVQ-based methods, the learning result won’t be affected by the sequence of input patterns that come into the ILVQ. Results of experiments described herein show that the proposed ILVQ can accommodate the non-stationary data environment and can provide good recognition performance and storage efficiency.

Ye Xu, Shen Furao, Osamu Hasegawa, Jinxi Zhao
On Mining Rating Dependencies in Online Collaborative Rating Networks

The trend of social information processing sees e-commerce and social web applications increasingly relying on user-generated content, such as rating, to determine the quality of objects and to generate recommendations for users. In a rating system, a set of reviewers assign to a set of objects different types of scores based on specific evaluation criteria. In this paper, we seek to determine, for each reviewer and for each object, the dependency between scores on any two given criteria. A reviewer is said to have high dependency between a pair of criteria when his or her rating scores on objects based on the two criteria exhibit strong correlation. On the other hand, an object is said to have high dependency between a pair of criteria when the rating scores it receives on the two criteria exhibit strong correlation. Knowing reviewer dependency and object dependency is useful in various applications including recommendation, customization, and score moderation. We propose a model, called

Interrelated Dependency

, which determines both types of dependency simultaneously, taking into account the interrelatedness between the two types of dependency. We verify the efficacy of this model through experiments on real-life data.

Hady W. Lauw, Ee-Peng Lim, Ke Wang
Learning to Extract Relations for Relational Classification

Relational classifiers use relations between objects to predict the class values. In some cases the relations are explicitly given. In other cases the dataset contains implicit relations, e.g. the relation is hidden inside of noisy attribute values. To apply relational classifiers for this task, the relations have to be extracted. Manually extracting relations by a domain expert is an expensive and time consuming task. In this paper we show how extracting relations in datasets with noisy attribute values can be learned. Our method LRE uses a regression model to learn and predict weighted binary relations. We show that LRE is able to extract both equivalence relations and non-constrained relations. Secondly we show that relational classifiers using relations automatically extracted by LRE achieve comparable classification quality as classifiers using manually labeled relations.

Steffen Rendle, Christine Preisach, Lars Schmidt-Thieme
Backmatter
Metadaten
Titel
Advances in Knowledge Discovery and Data Mining
herausgegeben von
Thanaruk Theeramunkong
Boonserm Kijsirikul
Nick Cercone
Tu-Bao Ho
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-01307-2
Print ISBN
978-3-642-01306-5
DOI
https://doi.org/10.1007/978-3-642-01307-2