Skip to main content

2011 | Buch

Advances in Knowledge Discovery and Data Mining

15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, Part I

herausgegeben von: Joshua Zhexue Huang, Longbing Cao, Jaideep Srivastava

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNAI 6634 and 6635 constitutes the refereed proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2011, held in Shenzhen, China in May 2011. The total of 32 revised full papers and 58 revised short papers were carefully reviewed and selected from 331 submissions. The papers present new ideas, original research results, and practical development experiences from all KDD-related areas including data mining, machine learning, artificial intelligence and pattern recognition, data warehousing and databases, statistics, knowledge engineering, behavior sciences, visualization, and emerging areas such as social network analysis.

Inhaltsverzeichnis

Frontmatter

Feature Extraction

An Instance Selection Algorithm Based on Reverse Nearest Neighbor

Data reduction is to extract a subset from a dataset. The advantages of data reduction are decreasing the requirement of storage and increasing the efficiency of classification. Using the subset as training data is possible to maintain classification accuracy; sometimes, it can be further improved because of eliminating noises. The key is how to choose representative samples while ignoring noises at the same time. Many instance selection algorithms are based on nearest neighbor decision rule (NN). Some of these algorithms select samples based on two strategies, incremental and decremental. The first type of algorithms select some instances as samples and iteratively add instances which do not have the same class label with their nearest sample to the sample set. The second type of algorithms remove instances which do not have the same class label with their majority of kNN. However, we propose an algorithm based on Reverse Nearest Neighbor (RNN), called the Reverse Nearest Neighbor Reduction (RNNR). RNNR selects samples which can represent other instances in the same class. In addition, RNNR does not need to iteratively scan a dataset which takes much processing time. Experimental results show that RNNR achieves comparable accuracy and selects fewer samples than comparators.

Bi-Ru Dai, Shu-Ming Hsu
A Game Theoretic Approach for Feature Clustering and Its Application to Feature Selection

In this paper, we develop a game theoretic approach for clustering features in a learning problem. Feature clustering can serve as an important preprocessing step in many problems such as feature selection, dimensionality reduction, etc. In this approach, we view features as rational players of a coalitional game where they form coalitions (or clusters) among themselves in order to maximize their individual payoffs. We show how Nash Stable Partition (NSP), a well known concept in the coalitional game theory, provides a natural way of clustering features. Through this approach, one can obtain some desirable properties of the clusters by choosing appropriate payoff functions. For a small number of features, the NSP based clustering can be found by solving an integer linear program (ILP). However, for large number of features, the ILP based approach does not scale well and hence we propose a hierarchical approach. Interestingly, a key result that we prove on the equivalence between a k-size NSP of a coalitional game and minimum k-cut of an appropriately constructed graph comes in handy for large scale problems. In this paper, we use feature selection problem (in a classification setting) as a running example to illustrate our approach. We conduct experiments to illustrate the efficacy of our approach.

Dinesh Garg, Sellamanickam Sundararajan, Shirish Shevade
Feature Selection Strategy in Text Classification

Traditionally, the best number of features is determined by the so-called “rule of thumb”, or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

Pui Cheong Gabriel Fung, Fred Morstatter, Huan Liu
Unsupervised Feature Weighting Based on Local Feature Relatedness

Feature weighting plays an important role in text clustering. Traditional feature weighting is determined by the syntactic relationship between feature and document (e.g. TF-IDF). In this paper, a semantically enriched feature weighting approach is proposed by introducing the semantic relationship between feature and document, which is implemented by taking account of the local feature relatedness — the relatedness between feature and its contextual features within each individual document. Feature relatedness is measured by two methods, document collection-based implicit relatedness measure and Wikipedia link-based explicit relatedness measure. Experimental results on benchmark data sets show that the new feature weighting approach surpasses traditional syntactic feature weighting. Moreover, clustering quality can be further improved by linearly combining the syntactic and semantic factors. The new feature weighting approach is also compared with two existing feature relatedness-based approaches which consider the global feature relatedness (feature relatedness in the entire feature space) and the inter-document feature relatedness (feature relatedness between different documents) respectively. In the experiments, the new feature weighting approach outperforms these two related work in clustering quality and costs much less computational complexity.

Jiali Yun, Liping Jing, Jian Yu, Houkuan Huang
An Effective Feature Selection Method for Text Categorization

Feature selection is an efficient strategy to reduce the dimensionality of data and removing the noise in text categorization. However, most feature selection methods aim to remove non-informative features based on corpus statistics, which do not relate to the classification accuracy directly. In this paper, we propose an effective feature selection method, which aims at the classification accuracy of KNN. Our experiments show that our method is better than the traditional methods, and it is also beneficial to other classifiers, such as Support Vector Machines (SVM).

Xipeng Qiu, Jinlong Zhou, Xuanjing Huang

Machine Learning

A Subpath Kernel for Rooted Unordered Trees

Kernel method is one of the promising approaches to learning with tree-structured data, and various efficient tree kernels have been proposed to capture informative structures in trees. In this paper, we propose a new tree kernel function based on “subpath sets” to capture vertical structures in rooted unordered trees, since such tree-structures are often used to code hierarchical information in data. We also propose a simple and efficient algorithm for computing the kernel by extending the multikey quicksort algorithm used for sorting strings. The time complexity of the algorithm is O((|

T

1

| + |

T

2

|)log(|

T

1

| + |

T

2

|)) time on average, and the space complexity is O(|

T

1

| + |

T

2

|), where |

T

1

| and |

T

2

| are the numbers of nodes in two trees

T

1

and

T

2

. We apply the proposed kernel to two supervised classification tasks, XML classification in web mining and glycan classification in bioinformatics. The experimental results show that the predictive performance of the proposed kernel is competitive with that of the existing efficient tree kernel for unordered trees proposed by Vishwanathan

et al.

[1], and is also empirically faster than the existing kernel.

Daisuke Kimura, Tetsuji Kuboyama, Tetsuo Shibuya, Hisashi Kashima
Classification Probabilistic PCA with Application in Domain Adaptation

Conventional dimensionality reduction algorithms such as principle component analysis (PCA) and non-negative matrix factorization (NMF) are unsupervised. Supervised probabilistic PCA (SPPCA) can utilize label information. However, this information is usually treated as regression targets rather than discrete nominal labels. We propose a classification probabilistic PCA (CPPCA) which is an extension of probabilistic PCA. Unlike SPPCA, the label class information is turned into a class probabilistic function by using a sigmoidal function. As the posterior distribution of latent variables are non-Gaussian, we use Laplace approximation with Expectation Maximization (EM) to obtain the solution. The formulation is applied to a domain adaptation classification problem where the labeled training data and unlabeled test data come from different but related domains. Experimental results show that the proposed model has accuracy over conventional probabilistic PCA, SPPCA and its semi-supervised version. It has similar performance when compared with popular dedicated algorithms for domain adaptation, the structural correspondence learning (SCL) and its variants.

Victor Cheng, Chun-Hung Li
Probabilistic Matrix Factorization Leveraging Contexts for Unsupervised Relation Extraction

The clustering of the semantic relations between entities extracted from a corpus is one of the main issues in unsupervised relation extraction (URE). Previous methods assume a huge corpus because they have utilized frequently appearing entity pairs in the corpus. In this paper, we present a URE that works well for a small corpus by using word sequences extracted as relations. The feature vectors of the word sequences are extremely sparse. To deal with the sparseness problem, we take the two approaches: dimension reduction and leveraging context in the whole corpus including sentences from which no relations are extracted. The context in this case is captured with feature co-occurrences, which indicate appearances of two features in a single sentence. The approaches are implemented by a probabilistic matrix factorization that jointly factorizes the matrix of the feature vectors and the matrix of the feature co-occurrences. Experimental results show that our method outperforms previously proposed methods.

Shingo Takamatsu, Issei Sato, Hiroshi Nakagawa
The Unsymmetrical-Style Co-training

Semi-supervised learning has attracted much attention over the past decade because it provides the advantage of combining unlabeled data with labeled data to improve the learning capability of models. Co-training is a representative paradigm of semi-supervised learning methods. Typically, some co-training style algorithms, such as

co-training

and

co-EM

, learn two classifiers based on two views of the instance space. But they have to satisfy the assumptions that these two views are sufficient and conditionally independent given the class labels. Other co-training style algorithms, such as

multiple-learner

, use two different underlying classifiers based on only a single view of the instance space. However, they could not utilize the labeled data effectively, and suffer from the early convergence. After analyzing various co-training style algorithms, we have found that all of these algorithms have symmetrical framework structures that are related to their constraints. In this paper, we propose a novel unsymmetrical-style method, which we call the

unsymmetrical co-training algorithm

. The unsymmetrical co-training algorithm combines the advantages of other co-training style algorithms and overcomes their disadvantages. Within our unsymmetrical structure, we apply two unsymmetrical classifiers, namely, the self-training classifier and the EM classifier, and then train these two classifiers in an unsymmetrical way. The unsymmetrical co-training algorithm not only avoids the constraint of the conditional independence assumption, but also overcomes the flaws of the early convergence and the ineffective utilization of labeled data. We conduct experiments to compare the performances of these co-training style algorithms. From the experimental results, we can see that the unsymmetrical co-training algorithm outperforms other co-training algorithms.

Bin Wang, Harry Zhang, Bruce Spencer, Yuanyuan Guo
Balance Support Vector Machines Locally Using the Structural Similarity Kernel

A structural similarity kernel is presented in this paper for SVM learning, especially for learning with imbalanced datasets. Kernels in SVM are usually pairwise, comparing the similarity of two examples only using their feature vectors. By building a neighborhood graph (kNN graph) using the training examples, we propose to utilize the similarity of linking structures of two nodes as an additional similarity measure. The structural similarity measure is proven to form a positive definite kernel and is shown to be equivalent to a regularization term that encourages balanced weights in all local neighborhoods. Analogous to the unsupervised HITS algorithm, the structural similarity kernel turns hub scores into signed authority scores, and is particularly effective in dealing with imbalanced learning problems. Experimental results on several benchmark datasets show that structural similarity can help the linear and the histogram intersection kernel to match or surpass the performance of the RBF kernel in SVM learning, and can significantly improve imbalanced learning results.

Jianxin Wu
Using Classifier-Based Nominal Imputation to Improve Machine Learning

Many learning algorithms perform poorly when the training data are incomplete. One standard approach involves first imputing the missing values, then giving the completed data to the learning algorithm. However, this is especially problematic when the features are nominal. This work presents “classifier-based nominal imputation” (CNI), an easy-to-implement and effective nominal imputation technique that views nominal imputation as classification: it learns a classifier for each feature (that maps the other features of an instance to the predicted value of that feature), then uses that classifier to predict the missing values of that feature. Our empirical results show that learners that preprocess their incomplete training data using CNI using support vector machine or decision tree learners have significantly higher predictive accuracy than learners that (1) do not use preprocessing, (2) use baseline imputation techniques, or (3) use this CNI preprocessor with other classification algorithms. This improvement is especially apparent when the base learner is instance-based. CNI is also found helpful for other base learners, such as naïve Bayes and decision tree, on incomplete nominal data.

Xiaoyuan Su, Russell Greiner, Taghi M. Khoshgoftaar, Amri Napolitano
A Bayesian Framework for Learning Shared and Individual Subspaces from Multiple Data Sources

This paper presents a novel Bayesian formulation to exploit shared structures across multiple data sources, constructing foundations for effective mining and retrieval across disparate domains. We jointly analyze diverse data sources using a unifying piece of metadata (textual tags). We propose a method based on Bayesian Probabilistic Matrix Factorization (BPMF) which is able to explicitly model the partial knowledge common to the datasets using shared subspaces and the knowledge specific to each dataset using individual subspaces. For the proposed model, we derive an efficient algorithm for learning the joint factorization based on Gibbs sampling. The effectiveness of the model is demonstrated by social media retrieval tasks across single and multiple media. The proposed solution is applicable to a wider context, providing a formal framework suitable for exploiting individual as well as mutual knowledge present across heterogeneous data sources of many kinds.

Sunil Kumar Gupta, Dinh Phung, Brett Adams, Svetha Venkatesh
Are Tensor Decomposition Solutions Unique? On the Global Convergence HOSVD and ParaFac Algorithms

Matrix factorizations and tensor decompositions are now widely used in machine learning and data mining. They decompose input matrix and tensor data into matrix factors by optimizing a least square objective function using iterative updating algorithms,

e.g.

HOSVD (High Order Singular Value Decomposition) and ParaFac (Parallel Factors). One fundamental problem of these algorithms remains unsolved: are the solutions found by these algorithms global optimal? Surprisingly, we provide a positive answer for HSOVD and negative answer for ParaFac by combining theoretical analysis and experimental evidence. Our discoveries of this intrinsic property of HOSVD assure us that in real world applications HOSVD provides repeatable and reliable results.

Dijun Luo, Chris Ding, Heng Huang
Improved Spectral Hashing

Nearest neighbor search is one of the most fundamental problem in machine learning, machine vision, clustering, information retrieval, etc. To handle a dataset of million or more records, efficient storing and retrieval techniques are needed. Binary code is an efficient method to address these two problems. Recently, the problem of finding good binary code has been formulated and solved, resulting in a technique called spectral hashing [21]. In this work we analyze the spectral hashing, its possible shortcomings and solutions. Experimental results are promising.

Sanparith Marukatat, Wasin Sinthupinyo

Clustering

High-Order Co-clustering Text Data on Semantics-Based Representation Model

The language modeling approach is widely used to improve the performance of text mining in recent years because of its solid theoretical foundation and empirical effectiveness. In essence, this approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smooth techniques. Semantic smoothing, which incorporates semantic and contextual information into the language models, is effective and potentially significant to improve the performance of text mining. In this paper, we proposed a high-order structure to represent text data by incorporating background knowledge, Wikipedia. The proposed structure consists of three types of objects, term, document and concept. Moreover, we firstly combined the high-order co-clustering algorithm with the proposed model to simultaneously cluster documents, terms and concepts. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that our proposed high-order co-clustering on high-order structure outperforms the general co-clustering algorithm on bipartite text data, such as document-term, document-concept and document-(term+concept).

Liping Jing, Jiali Yun, Jian Yu, Joshua Huang
The Role of Hubness in Clustering High-Dimensional Data

High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower-dimensional feature subspace, we embrace dimensionality by taking advantage of some inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in

k

-nearest neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by proposing several hubness-based clustering algorithms and testing them on high-dimensional data. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise.

Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, Mirjana Ivanović
Spatial Entropy-Based Clustering for Mining Data with Spatial Correlation

Due to the inherent characteristics of spatial datasets, spatial clustering methods need to consider spatial attributes, non-spatial attributes and spatial correlation among non-spatial attributes across space. However, most existing spatial clustering methods ignore spatial correlation, considering spatial and non-spatial attributes independently. In this paper, we first prove that spatial entropy is a monotonic decreasing function for non-spatial attribute similarity and spatial correlation. Then we propose a novel density-based spatial clustering method called SEClu, which applies spatial entropy in measuring non-spatial attribute similarity and spatial correlation during the clustering process. The experimental results from both the synthetic data and the real application demonstrate that SEClu can effectively identify spatial clusters with spatial correlated patterns.

Baijie Wang, Xin Wang
Self-adjust Local Connectivity Analysis for Spectral Clustering

Spectral clustering has been applied in various applications. But there still exist some important issues to be resolved, among which the two major ones are to (1) specify the scale parameter in calculating the similarity between data objects, and (2) select propoer eigenvectors to reduce data dimensionality. Though these topics have been studied extensively, the existing methods cannot work well in some complicated scenarios, which limits the wide deployment of the spectral clustering method. In this work, we revisit the above two problems and propose three contributions to the field: 1) a unified framework is designed to study the impact of the scale parameter on similarity between data objects. This framework can easily accommodate various state of art spectral clustering methods in determining the scale parameter; 2) a novel approach based on local connectivity analysis is proposed to specify the scale parameter; 3) propose a new method for eigenvector selection. Compared with existing techniques, the proposed approach has a rigorous theoretical basis and is efficient from practical perspective. Experimental results show the efficacy of our approach to clustering data of different scenarios.

Hui Wu, Guangzhi Qu, Xingquan Zhu
An Effective Density-Based Hierarchical Clustering Technique to Identify Coherent Patterns from Gene Expression Data

We present an effective tree-based clustering technique (

Gene ClusTree

) for finding clusters over gene expression data.

GeneClusTree

attempts to find all the clusters over subspaces using a tree-based density approach by scanning the whole database in minimum possible scans and is free from the restrictions of using a normal proximity measure [1]. Effectiveness of

GeneClusTree

is established in terms of well known z-score measure and

p

-value over several real-life datasets. The

p

-value analysis shows that our technique is capable in detecting biologically relevant clusters from gene expression data.

Sauravjyoti Sarmah, Rosy Das Sarmah, Dhruba Kumar Bhattacharyya
Nonlinear Discriminative Embedding for Clustering via Spectral Regularization

In this paper, we propose a novel nonlinear discriminative dimensionality reduction method for clustering high dimensional data. The proposed method first represents the desired low dimensional nonlinear embedding as linear combinations of predefined smooth vectors with respect to data manifold, which are characterized by a weighted graph. Then the optimal combination coefficients and optimal cluster assignment matrix are computed by maximizing between-cluster scatter and minimizing within-cluster scatter simultaneously as well as preserving smoothness of the cluster assignment matrix with respect to the data manifold. We solve the optimization problem in an iterative algorithm, which is proved to be convergent. The contributions of the proposed method are two folds: 1) obtained nonlinear embedding can recover intrinsic manifold structure as well as enhance the cluster structure of the original data; 2) the cluster results can also be obtained in dimensionality reduction procedure. Extensive experiments conducted on UCI data sets and real world data sets have shown the effectiveness of the proposed method for both clustering and visualization high dimensional data set.

Yubin Zhan, Jianping Yin
An Adaptive Fuzzy k-Nearest Neighbor Method Based on Parallel Particle Swarm Optimization for Bankruptcy Prediction

This study proposes an efficient non-parametric classifier for bankruptcy prediction using an adaptive fuzzy

k

-nearest neighbor (FKNN) method, where the nearest neighbor

k

and the fuzzy strength parameter

m

are adaptively specified by the particle swarm optimization (PSO) approach. In addition to performing the parameter optimization for FKNN, PSO is utilized to choose the most discriminative subset of features for prediction as well. Time varying acceleration coefficients (TVAC) and inertia weight (TVIW) are employed to efficiently control the local and global search ability of PSO. Moreover, both the continuous and binary PSO are implemented in parallel on a multi-core platform. The resultant bankruptcy prediction model, named PTVPSO-FKNN, is compared with three classification methods on a real-world case. The obtained results clearly confirm the superiority of the developed model as compared to the other three methods in terms of Classification accuracy, Type I error, Type II error and AUC (area under the receiver operating characteristic (ROC) curve) criterion. It is also observed that the PTVPSO-FKNN is a powerful feature selection tool which has indentified a subset of best discriminative features. Additionally, the proposed model has gained a great deal of efficiency in terms of CPU time owing to the parallel implementation.

Hui-Ling Chen, Da-You Liu, Bo Yang, Jie Liu, Gang Wang, Su-Jing Wang
Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data

Semi-supervised clustering can yield considerable improvement over unsupervised clustering. Most existing semi-supervised clustering algorithms are non-hierarchical, derived from the

k

-means algorithm and designed for analyzing numeric data. Clustering categorical data is a challenging issue due to the lack of inherently meaningful similarity measure, and semi-supervised clustering in the categorical domain remains untouched. In this paper, we propose a novel semi-supervised divisive hierarchical algorithm for categorical data. Our algorithm is parameter-free, fully automatic and effective in taking advantage of instance-level constraint background knowledge to improve the quality of the resultant dendrogram. Experiments on real-life data demonstrate the promising performance of our algorithm.

Tengke Xiong, Shengrui Wang, André Mayers, Ernest Monga

Classification

Identifying Hidden Contexts in Classification

In this study we investigate how to identify hidden contexts from the data in classification tasks. Contexts are artifacts in the data, which do not predict the class label directly. For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words. Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure. We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques. We form a collection of performance measures to ensure that the resulting contexts are valid. We evaluate the performance of the proposed techniques on thirty real datasets. We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers.

Indrė Žliobaitė
Cross-Lingual Sentiment Classification via Bi-view Non-negative Matrix Tri-Factorization

Recently the sentiment classification problem interests the researchers over the world, but most sentiment corpora are in English, which limits the research progress on sentiment classification in other languages. Cross-lingual sentiment classification aims to use annotated sentiment corpora in one language (e.g. English) as training data, to predict the sentiment polarity of the data in another language (e.g. Chinese). In this paper, we design a bi-view non-negative matrix tri-factorization (BNMTF) model for the cross-lingual sentiment classification problem. We employ machine translation service so that both training and test data is able to have two representation, one in source language and the other in target language. Our BNMTF model is derived from the non-negative matrix tri-factorization models in both languages in order to make more accurate prediction. Our BNMTF model has three main advantages: (1) combining the information from two views (2) incorporating the lexical knowledge and training document label knowledge (3) adding information from test documents. Experimental results show the effectiveness of our BNMTF model, which can outperform other baseline approaches to cross-lingual sentiment classification.

Junfeng Pan, Gui-Rong Xue, Yong Yu, Yang Wang
A Sequential Dynamic Multi-class Model and Recursive Filtering by Variational Bayesian Methods

Adaptive classification evolving over time is an important learning task that arises in many applications. In this paper, a sequential dynamic multi-class model (SDMM) is proposed for representing the multi-class adaptive learning task, which is based on the polychotomous response model and dynamic logistic regression. Multiple state chains in the SDMM are coupled by the observable labels and feature vectors. Each state chain is modeled as a first-order Markov process with time-varying covariance parameters for characterizing the non-stationary generating process of sequential labels. Augmented auxiliary variables are introduced for developing efficient inference procedures according to the popular data augmentation strategy. Variational Bayesian methods are applied to estimate the dynamic state variables and augmented auxiliary variables recursively. According to the results of recursive filtering procedures using mean-field approximation forms, one-step-ahead predicted probabilities are calculated by marginalizing the state variables. Experiment results based on both synthetic and real data show that the proposed model significantly outperforms the non-sequential static methods for the multi-class adaptive learning problems with missing labels. Encouraging results have been obtained by comprising well-known multi-class data stream algorithms.

Xiangyun Qing, Xingyu Wang
Random Ensemble Decision Trees for Learning Concept-Drifting Data Streams

Few online classification algorithms based on traditional inductive ensembling focus on handling concept drifting data streams while performing well on noisy data. Motivated by this, an incremental algorithm based on random Ensemble Decision Trees for Concept-drifting data streams (EDTC) is proposed in this paper. Three variants of

random feature selection

are developed to implement split-tests. To better track concept drifts in data streams with noisy data, an improved two-threshold-based drifting detection mechanism is introduced. Extensive studies demonstrate that our algorithm performs very well compared to several known online algorithms based on single models and ensemble models. A conclusion is hence drawn that multiple solutions are provided for learning from concept drifting data streams with noise.

Peipei Li, Xindong Wu, Qianhui Liang, Xuegang Hu, Yuhong Zhang
Collaborative Data Cleaning for Sentiment Classification with Noisy Training Corpus

Labeled review corpus is considered as a very valuable resource for the task of sentiment classification of product reviews. Fortunately, there are a large amount of product reviews on the Web, and each review is associated with a tag assigned by users to indicate its polarity orientation. We can download such reviews with tags and use them as training corpus for sentiment classification. However, users may assign the polarity tag arbitrarily and inaccurately, and some tags are not appropriate, which results in that the automatically constructed corpus contains many noises and the noisy instances will deteriorate the classification performance. In this paper, we propose the co-cleaning and tri-cleaning algorithms to collaboratively clean the corpus and thus improve the sentiment classification performance. The proposed algorithms use multiple classifiers to iteratively select and remove the most confidently noisy instances from the corpus. Experimental results verify the effectiveness of our proposed algorithms, and the tri-cleaning algorithm is most effective and promising.

Xiaojun Wan

Pattern Mining

Using Constraints to Generate and Explore Higher Order Discriminative Patterns

Discriminative pattern mining looks for association patterns that occur more frequently in one class than another and has important applications in many areas including finding biomarkers in biomedical data. However, finding such patterns is challenging because higher order combinations of variables may show high discrimination even when single variables or lower-order combinations show little or no discrimination. Thus, generating such patterns is important for evaluating discriminative pattern mining algorithms and better understanding the nature of discriminative patterns. To that end, we describe how such patterns can be defined using mathematical constraints which are then solved with widely available software that generates solutions for the resulting optimization problem. We present a basic formulation of the problem obtained from a straightforward translation of the desired pattern characteristics into mathematical constraints, and then show how the pattern generation problem can be reformulated in terms of the selection of rows from a truth table. This formulation is more efficient and provides deeper insight into the process of creating higher order patterns. It also makes it easy to define patterns other than just those based on the conjunctive logic used by traditional association and discriminant pattern analysis.

Michael Steinbach, Haoyu Yu, Gang Fang, Vipin Kumar
Mining Maximal Co-located Event Sets

A spatial co-location is a set of spatial events being frequently observed together in nearby geographic space. A common framework for mining spatial association patterns employs a level-wised search method (like Apriori). However, the Apriori-based algorithms do not scale well for discovering long co-location patterns in large or dense spatial neighborhoods and can be restricted for only short pattern discovery. To address this problem, we propose an algorithm for finding maximal co-located event sets which concisely represent all co-location patterns. The proposed algorithm generates only most promising candidates, traverses the pattern search space in depth-first manner with an effective pruning scheme, and reduces expensive co-location instance search operations. Our experiment result shows that the proposed algorithm is computationally effective when mining maximal co-locations.

Jin Soung Yoo, Mark Bow
Pattern Mining for a Two-Stage Information Filtering System

As information available over computer networks is growing exponentially, searching for useful information becomes increasingly more difficult. Accordingly, developing an effective information filtering mechanism is becoming very important to alleviate the problem of information overload. Information filtering systems often employ user profiles to represent users’ information needs so as to determine the relevance of documents from an incoming data stream. This paper presents a novel two-stage information filtering model which combines the merits of term-based and pattern-based approaches to effectively filter sheer volume of information. In particular, the first filtering stage is supported by a novel rough analysis model which efficiently removes a large number of irrelevant documents, thereby addressing the overload problem. The second filtering stage is empowered by a semantically rich pattern taxonomy mining model which effectively fetches incoming documents according to the specific information needs of a user, thereby addressing the mismatch problem. The experimental results based on the RCV1 corpus show that the proposed two-stage filtering model significantly outperforms both the term-based and pattern-based information filtering models.

Xujuan Zhou, Yuefeng Li, Peter Bruza, Yue Xu, Raymond Y. K. Lau
Efficiently Retrieving Longest Common Route Patterns of Moving Objects By Summarizing Turning Regions

The popularity of online location services provides opportunities to discover useful knowledge from trajectories of moving objects. This paper addresses the problem of mining longest common route (LCR) patterns. As a trajectory of a moving object is generally represented by a sequence of discrete locations sampled with an interval, the different trajectory instances along the same route may be denoted by different sequences of points (location, timestamp). Thus, the most challenging task in the mining process is to abstract trajectories by the right points. We propose a novel mining algorithm for LCR patterns based on turning regions (LCRTurning), which discovers a sequence of turning regions to abstract a trajectory and then maps the problem into the traditional problem of mining longest common subsequences (LCS). Effectiveness of LCRTurning algorithm is validated by an experimental study based on various sizes of simulated moving objects datasets.

Guangyan Huang, Yanchun Zhang, Jing He, Zhiming Ding
Automatic Assignment of Item Weights for Pattern Mining on Data Streams

Research in Weighted Association Rule Mining (WARM) has largely concentrated on mining traditional static transactional datasets. Whilst there have been a few attempts at researching WARM in a data stream environment, none have addressed the problem of assigning and adapting weights in the presence of concept drift, which often occurs in a data stream environment. In this research we experiment with two methods of adapting weights; firstly, a simplistic method that recomputes the entire set of weights at fixed intervals, and secondly a method that relies on a distance function that assesses the extent of change in the stream and only updates those items that have had significant change in their patterns of interaction. We show that the latter method is able to maintain good accuracy whilst being several times faster than the former.

Yun Sing Koh, Russel Pears, Gillian Dobbie

Prediction

Predicting Private Company Exits Using Qualitative Data

Private companies backed by venture capitalists or private equity funds receive their funding in a series of rounds. Information about when each round occurred and which investors participated in each round has been compiled into different databases. Here we mine one such database to model how the private company will exit the VC/PE space. More specifically, we apply a random forest algorithm to each of nine sectors of private companies. Resampling is used to correct imbalanced class distributions. Our results show that a late-stage investor may be able to leverage purely qualitative knowledge of a company’s first three rounds of funding to assess the probability that (1) the company will not go bankrupt and (2) the company will eventually make an exit of some kind (and no longer remain private). For both of these two-class classification problems, our models’ out-of-sample success rate is 75% and the area under the ROC curve is 0.83, averaged across all sectors. Finally, we use the random forest classifier to rank the covariates based on how predictive they are. The results indicate that the models could provide both predictive and explanatory power for business decisions.

Harish S. Bhat, Daniel Zaelit
A Rule-Based Method for Customer Churn Prediction in Telecommunication Services

Rule-based classification methods, which provide the interpretation of a classification, are very useful in churn prediction. However, most of the rule-based methods are not able to provide the prediction probability which is helpful for evaluating customers. This paper proposes a rule induction based classification algorithm, called CRL. CRL applies several heuristic methods to learn a set of rules, and then uses them to predict the customer potential behaviours. The experiments were carried out to evaluate the proposed method, based on 4 datasets of University of California, Irvine(UCI) and one dataset of telecoms. The experimental results show that CRL can achieve high classification accuracy and outperforms the existing rule-based methods in churn prediction.

Ying Huang, Bingquan Huang, M. -T. Kechadi

Text Mining

Adaptive and Effective Keyword Search for XML

Most of the existing methods for XML keyword search are based on the notion of Lowest Common Ancestor (LCA). However, as we explore the most important fundamental flaw inside those result models is that the search results are eternally determined and nonadjustable. In order to serve better results, we propose a novel and flexible result model which can avoid all these defects. Within our model, a scoring function is presented to judge the quality of each result. The considered metrics of evaluating results are weighted, and can be updated as needed. Based on the result model, three heuristic algorithms are proposed. Moreover, a mechanism is employed to select the most suitable one out of these algorithms to generate better results. Extensive experiments show that our approach outperforms any LCA-based ones with higher recall and precision.

Weidong Yang, Hao Zhu, Nan Li, Guansheng Zhu
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing in Bayesian Topic Models

This paper provides a new approach to topical trend analysis. Our aim is to improve the generalization power of latent Dirichlet allocation (LDA) by using document timestamps. Many previous works model topical trends by making latent topic distributions time-dependent. We propose a straightforward approach by preparing a different word multinomial distribution for each time point. Since this approach increases the number of parameters, overfitting becomes a critical issue. Our contribution to this issue is two-fold. First, we propose an effective way of defining Dirichlet priors over the word multinomials. Second, we propose a special scheduling of variational Bayesian (VB) inference. Comprehensive experiments with six datasets prove that our approach can improve LDA and also Topics over Time, a well-known variant of LDA, in terms of test data perplexity in the framework of VB inference.

Tomonari Masada, Atsuhiro Takasu, Yuichiro Shibata, Kiyoshi Oguri
Constrained LDA for Grouping Product Features in Opinion Mining

In opinion mining of product reviews, one often wants to produce a summary of opinions based on product features. However, for the same feature, people can express it with different words and phrases. To produce an effective summary, these words and phrases, which are domain synonyms, need to be grouped under the same feature. Topic modeling is a suitable method for the task. However, instead of simply letting topic modeling find groupings freely, we believe it is possible to do better by giving it some pre-existing knowledge in the form of automatically extracted constraints. In this paper, we first extend a popular topic modeling method, called Latent Dirichlet Allocation (LDA), with the ability to process

large

scale constraints. Then, two novel methods are proposed to extract two types of constraints automatically. Finally, the resulting

constrained-LDA

and the extracted constraints are applied to group product features. Experiments show that

constrained-LDA

outperforms the original LDA and the latest

mLSA

by a large margin.

Zhongwu Zhai, Bing Liu, Hua Xu, Peifa Jia
Semantic Dependent Word Pairs Generative Model for Fine-Grained Product Feature Mining

In the field of opinion mining, extraction of fine-grained product feature is a challenging problem. Noun is the most important features to represent product features. Generative model such as the latent Dirichlet allocation (LDA) has been used for detecting keyword clusters in document corpus. As adjectives often dominate review corpus, they are often excluded from the vocabulary in such generative model for opinion sentiment analysis. On the other hand, adjectives provide useful context for noun features as they are often semantically related to the nouns. To take advantage of such semantic relations, dependency tree is constructed to extract pairs of noun and adjective with semantic dependency relation. We propose a semantic dependent word pairs generative model for pairs of noun and adjective for each sentence. Product features and their corresponding adjectives are simultaneously clustered into distinct groups which enable improved accuracy of product features as well as providing clustered adjectives. Experimental results demonstrated the advantage of our models with lower perplexity, average cluster entropies, compared to baseline models based on LDA. Highly semantic cohesive, descriptive and discriminative fine-grained product features are obtained automatically.

Tian-Jie Zhan, Chun-Hung Li
Grammatical Dependency-Based Relations for Term Weighting in Text Classification

Term frequency and term co-occurrence are currently used to estimate term weightings in a document. However these methods do not employ relations based on grammatical dependency among terms to measure dependency between word features. In this paper, we propose a new approach that employs grammatical relations to estimate weightings of terms in a text document and present how to apply the term weighting scheme to text classification. A graph model is used to encode the extracted relations. A graph centrality algorithm is then applied to calculate scores that represent significance values of the terms in the document context. Experiments performed on many corpora with SVM classifier show that the proposed term weighting approach outperforms those based on term frequency and term co-occurrence.

Dat Huynh, Dat Tran, Wanli Ma, Dharmendra Sharma
XML Documents Clustering Using a Tensor Space Model

The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.

Sangeetha Kutty, Richi Nayak, Yuefeng Li
An Efficient Pre-processing Method to Identify Logical Components from PDF Documents

As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents.

This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format – PDF documents. The experimental results prove the effectiveness of the sparse line analysis.

Ying Liu, Kun Bai, Liangcai Gao
Combining Proper Name-Coreference with Conditional Random Fields for Semi-supervised Named Entity Recognition in Vietnamese Text

Named entity recognition (NER) is the process of seeking to locate atomic elements in text into predefined categories such as the names of persons, organizations and locations. Most existing NER systems are based on supervised learning. This method often requires a large amount of labelled training data, which is very time-consuming to build. To solve this problem, we introduce a semi-supervised learning method for recognizing named entities in Vietnamese text by combining proper name coreference, named-ambiguity heuristics with a powerful sequential learning model, Conditional Random Fields. Our approach inherits the idea of Liao and Veeramachaneni [6] and expands it by using proper name coreference. Starting by training the model using a small data set that is annotated manually, the learning model extracts high confident named entities and finds low confident ones by using proper name coreference rules. The low confident named entities are put in the training set to learn new context features. The F-scores of the system for extracting “Person”, “Location” and “Organization”entities are 83.36%, 69.53% and 65.71% when applying heuristics proposed by Liao and Veeramachaneni. Those values when using our proposed heuristics are 93.13%, 88.15% and 79.35%, respectively. It shows that our method is good in increasing the system accuracy.

Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, Thien Huu Nguyen
Topic Analysis of Web User Behavior Using LDA Model on Proxy Logs

We propose a web user profiling and clustering framework based on LDA-based topic modeling with an analogy to document analysis in which documents and words represent users and their actions. The main technical challenge addressed here is how to symbolize web access actions, by words, that are monitored through a web proxy. We develop a hierarchical URL dictionary generated from Yahoo! Directory and a cross-hierarchical matching method that provides the function of automatic abstraction. We apply the proposed framework to 7500 students in Osaka University. The results include, for example, 24 topics such as ”Technology Oriented”, ”Job Hunting”, and ”SNS-addict.” The results reflect the typical interest profiles of University students, while perplexity analysis is employed to confirm the optimality of the framework.

Hiroshi Fujimoto, Minoru Etoh, Akira Kinno, Yoshikazu Akinaga
SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content

Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don’t work well for short Web pages, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF_SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed near-duplicate news articles, which include both short and long Web pages.

Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, Hongfei Yan
Knowledge Transfer across Multilingual Corpora via Latent Topics

This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available.

Wim De Smet, Jie Tang, Marie-Francine Moens
Backmatter
Metadaten
Titel
Advances in Knowledge Discovery and Data Mining
herausgegeben von
Joshua Zhexue Huang
Longbing Cao
Jaideep Srivastava
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-20841-6
Print ISBN
978-3-642-20840-9
DOI
https://doi.org/10.1007/978-3-642-20841-6

Premium Partner