Skip to main content
Top

2005 | Book

Advanced Data Mining and Applications

First International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005. Proceedings

Editors: Xue Li, Shuliang Wang, Zhao Yang Dong

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

Table of Contents

Frontmatter

Keynote Papers

Decision Making with Uncertainty and Data Mining

Data mining is a newly developed and emerging area of computational intelligence that offers new theories, techniques, and tools for analysis of large data sets. It is expected to offer more and more support to modern organizations which face a serious challenge of how to make decision from massively increased information so that they can better understand their markets, customers, suppliers, operations and internal business processes. This paper discusses fuzzy decision-making using the Grey Related Analysis method. Fuzzy models are expected to better reflect decision maker uncertainty, at some cost in accuracy relative to crisp models. Monte Carlo Simulation, a data mining technique, is used to measure the impact of fuzzy models relative to crisp models. Fuzzy models were found to provide fits as good as crisp models in the data analyzed.

David L. Olson, Desheng Wu
Complex Networks and Networked Data Mining

There have been numerous and various complex networks with the development of science, technology and human society, such as the Internet, the World Wide Web, the network of air lines, large-scale electric power networks, the structure of a piece of Very Large-Scale Integration (VLSI),the human social relationships, the neural networks, and the spreading path net of an infectious disease, etc. Even in the study on semantics of human language, the relationship between synonyms can also be represented and analyzed via complex networks. Most researchers widely apply the parameters of degree distribution, clustering coefficient and average distance to analyze efficiently the uncertainty of complex networks.

Deyi Li, Guisheng Chen, Baohua Cao
In-Depth Data Mining and Its Application in Stock Market

Existing association rule mining algorithms are specifically designed to find strong patterns that have high predictive accuracy or correlation. Many useful patterns, for example, out-expectation patterns with low supports, are certainly pruned for efficiency in these mining algorithms. This talk introduces our ongoing research developing novel theories, techniques and methodologies for discovering hidden interactions within data, such as class-bridge rules and out-expectation patterns. These patterns are essentially different from traditional association rules, but are much more useful than traditional ones to applications such as cross-sales, trend prediction, detecting behavior changes, and recognizing rare but significant events. This delivers a paradigm shift from existing data mining techniques. In addition, the system of applying these techniques to stock market is briefly presented.

Chengqi Zhang, Shichao Zhang
Relevance of Counting in Data Mining Tasks

In many languages, the English word “computer” is often literally translated to “the counting machine.” Counting is apparently the most elementary operation that a computer can do, and thus it should be trivial to a computer to count. This, however, is a misconception. The apparently simple operation of enumeration and counting is actually computationally hard. It is also one of the most important elementary operation for many data mining tasks. We show how capital counting is for a variety of data mining applications and how this complex task can be achieved with acceptable efficiency.

Osmar R. Zaïane

Invited Papers

Term Graph Model for Text Classification

Most existing text classification methods (and text mining methods at large) are based on representing the documents using the traditional vector space model. We argue that important information, such as the relationship among words, is lost. We propose a term graph model to represent not only the content of a document but also the relationship among the keywords. We demonstrate that the new model enables us to define new similarity functions, such as considering rank correlation based on PageRank-style algorithms, for the classification purpose. Our preliminary results show promising results of our new model.

Wei Wang, Diep Bich Do, Xuemin Lin
A Latent Usage Approach for Clustering Web Transaction and Building User Profile

Web transaction data between web visitors and web functionalities usually convey users’ task-oriented behavior patterns. Clustering web transactions, thus, may capture such informative knowledge, in turn, build user profiles, which are associated with different navigational patterns. For some advanced web applications, such as web recommendation or personalization, the aforementioned work is crucial to make web users get their preferred information accurately. On the other hand, the conventional web usage mining techniques for clustering web objects often perform clustering on usage data directly rather than take the underlying semantic relationships among the web objects into account.

Latent Semantic Analysis

(LSA) model is a commonly used approach for capturing semantic associations among co-occurrence observations.. In this paper, we propose a LSA-based approach for such purpose. We demonstrated usability and scalability of the proposed approach through performing experiments on two real world datasets. The experimental results have validated the method’s effectiveness in comparison with some previous studies.

Yanchun Zhang, Guandong Xu, Xiaofang Zhou

Association Rules

Mining Quantitative Association Rules on Overlapped Intervals

Mining association rules is an important problem in data mining. Algorithms for mining boolean data have been well studied and documented, but they cannot deal with quantitative and categorical data directly. For quantitative attributes, the general idea is partitioning the domain of a quantitative attribute into intervals, and applying boolean algorithms to the intervals. But, there is a conflict between the minimum support problem and the minimum confidence problem, while existing partitioning methods cannot avoid the conflict. Moreover, we expect the intervals to be meaningful. Clustering in data mining is a discovery process which groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. The discovered clusters are used to explain the characteristics of the data distribution. The present paper will propose a novel method to find quantitative association rules by clustering the transactions of a database into clusters and projecting the clusters into the domains of the quantitative attributes to form meaningful intervals which may be overlapped. Experimental results show that our approach can efficiently find quantitative association rules, and can find important association rules which may be missed by the previous algorithms.

Qiang Tong, Baoping Yan, Yuanchun Zhou
An Approach to Mining Local Causal Relationships from Databases

Mining association rules and correlation relationships have been studied in the data mining field for many years. However, the rules mined only indicate association relationships among variables in an interested system. They do not specify the essential underlying mechanism of the system that describe causal relationships. In this paper, we present an approach for mining causal relationships among attributes and propose a potential application in the field of bioinformatics. Based on the theory of causal diagram, we show the properties of our approach.

Yang Bo He, Zhi Geng, Xun Liang
Mining Least Relational Patterns from Multi Relational Tables

Existing mining association rules in relational tables only focus on discovering the relationship among large data items in a database. However, association rule for significant rare items that appear infrequently in a database but are highly related with other items is yet to be discovered. In this paper, we propose an algorithm called Extraction Least Pattern (ELP) algorithm that using a couple of predefined minimum support thresholds. Results from the implementation reveal that the algorithm is capable of mining rare item in multi relational tables.

Siti Hairulnita Selamat, Mustafa Mat Deris, Rabiei Mamat, Zuriana Abu Bakar
Finding All Frequent Patterns Starting from the Closure

Efficient discovery of frequent patterns from large databases is an active research area in data mining with broad applications in industry and deep implications in many areas of data mining. Although many efficient frequent-pattern mining techniques have been developed in the last decade, most of them assume relatively small databases, leaving extremely large but realistic datasets out of reach. A practical and appealing direction is to mine for closed itemsets. These are subsets of all frequent patterns but good representatives since they eliminate what is known as redundant patterns. In this paper we introduce an algorithm to discover closed frequent patterns efficiently in extremely large datasets. Our implementation shows that our approach outperforms similar state-of-the-art algorithms when mining extremely large datasets by at least one order of magnitude in terms of both execution time and memory usage.

Mohammad El-Hajj, Osmar R. Zaïane
Multiagent Association Rules Mining in Cooperative Learning Systems

Recently, multiagent systems and data mining have attracted considerable attention in the computer science community. This paper combines these two hot research areas to introduce the term

multiagent association rule mining

on a cooperative learning system, which investigates employing data mining on a cooperative multiagent system. Learning in a partially observable and dynamic multiagent systems environment still constitutes a difficult and major research problem that is worth further investigation. Reinforcement learning has been proposed as a strong method for learning in multi-agent systems. So far, many researchers have proposed various methods to improve the learning ability in multiagent systems. However, reinforcement learning still has some drawbacks. One drawback is not modeling other learning agents present in the domain as part of the state of the environment. Another drawback is that even in learning case, some state-action pairs are experienced much less than others. In order to handle these problems, we describe a new action selection model based on association rules mining. Experimental results obtained on a well-known pursuit domain show the applicability, robustness and effectiveness of the proposed learning approach.

Reda Alhajj, Mehmet Kaya
VisAR : A New Technique for Visualizing Mined Association Rules

Many business organizations generate a huge amount of transaction data. Association rule mining is a powerful analysis tool to extract the useful meanings and associations from large databases and many automated systems have been developed for mining association rules. However, most of these systems usually mine many association rules from large databases and it is not easy for a user to extract meaningful rules. Visualization has become an important tool in the data mining process for extracting meaningful knowledge and information from large data sets. Though there are several techniques for visualizing mined association rules, most of these techniques visualize the entire set of discovered association rules on a single screen. Such a dense display can overwhelm analysts and reduce their capability of interpretation. In this paper we present a novel technique called

VisAR

for visualizing mined association rules. VisAR consists of four major stages for visualizing mined association rules. These stages include

managing association rules

,

filtering association rules of interest

,

visualizing selected association rules

, and

interacting with the visualization process

. Our technique allows an analyst to view only a particular subset of association rules which contain selected items of interest. VisAR is able to display not only many-to-one but also many-to-many association rules. Moreover, our technique can overcome problems of screen clutter and occlusion.

Kesaraporn Techapichetvanich, Amitava Datta
An Efficient Algorithm for Mining Both Closed and Maximal Frequent Free Subtrees Using Canonical Forms

A large number of text files, including HTML documents and XML documents, can be organized as tree structures. One objective of data mining is to discover frequent patterns in them. In this paper, first, we introduce a canonical form of free tree, which is based on the

breadth-first canonical string;

secondly, we present some properties of a closed frequent subtree and a maximal frequent subtree as well as their relationships

;

thirdly, we study a pruning technique of frequent free subtree and improvement on the mining of the nonclosed frequent free subtree; finally, we present an algorithm that mines all closed and maximal frequent free trees and prove validity of this algorithm.

Ping Guo, Yang Zhou, Jun Zhuang, Ting Chen, Yan-Rong Kang

Classification

E-CIDIM: Ensemble of CIDIM Classifiers

An active research area in Machine Learning is the construction of multiple classifier systems to increase learning accuracy of simple classifiers. In this paper we present E-CIDIM, a multiple classifier system designed to improve the performance of CIDIM, an algorithm that induces small and accurate decision trees. E-CIDIM keeps a maximum number of trees and it induces new trees that may substitute the old trees in the ensemble. The substitution process finishes when none of the new trees improves the accuracy of any of the trees in the ensemble after a pre-configured number of attempts. In this way, the accuracy obtained thanks to an unique instance of CIDIM can be improved. In reference to the accuracy of the generated ensembles, E-CIDIM competes well against bagging and boosting at statistically significance confidence levels and it usually outperforms them in the accuracy and the average size of the trees in the ensemble.

Gonzalo Ramos-Jiménez, José del Campo-Ávila, Rafael Morales-Bueno
Partially Supervised Classification – Based on Weighted Unlabeled Samples Support Vector Machine

This paper addresses a new classification technique: partially supervised classification (PSC), which is used to identify a specific land-cover class of interest from a remotely sensed image by using unique training samples belong to a specifically selected class. This paper also presents and discusses a novel Support Vector Machine (SVM) algorithm for PSC. Its training set includes labeled samples belong to the class of interest and unlabeled samples of all classes randomly selected from a remotely sensed image. Moreover, all unlabeled samples are assumed to be training samples of other classes and each of them is assigned a weighting factor indicating the likelihood of this assumption; hence, the algorithm is so-called ‘Weighted Unlabeled Sample SVM’ (WUS-SVM). Experimental results with both simulated and real data sets indicate that the proposed PSC method is more robust than 1-SVM and has comparable accuracy to a standard SVM.

Zhigang Liu, Wenzhong Shi, Deren Li, Qianqing Qin
Mining Correlated Rules for Associative Classification

Associative classification is a well-known technique which uses association rules to predict the class label for new data object. This model has been recently reported to achieve higher accuracy than traditional classification approaches. There are various strategies for good associative classification in its three main phases: rules generation, rules pruning and classification. Based on a systematic study of these strategies, we propose a new framework named

MCRAC

, i.e.,

M

ining

C

orrelated

R

ules for

A

ssociative

C

lassification

.

MCRAC

integrates the advantages of the previously proposed effective strategies as well as the new strategies presented in this paper. An extensive performance study reveals that the advantages of the strategies and the improvement of

MCRAC

outperform other associative classification approaches on accuracy.

Jian Chen, Jian Yin, Jin Huang
A Comprehensively Sized Decision Tree Generation Method for Interactive Data Mining of Very Large Databases

For interactive data mining of very large databases a method working with relatively small training data that can be extracted from the target databases by sampling is proposed, because it takes very long time to generate decision trees for the data mining of very large databases that contain many continues data values, and size of decision trees has the tendency of dependency on the size of training data. The method proposes to use samples of confidence in proper size as the training data to generate comprehensible trees as well as to save time. For medium or small databases direct use of original data with some harsh pruning may be used, because the pruning generates trees of similar size with smaller error rates.

Hyontai Sug
Using Latent Class Models for Neighbors Selection in Collaborative Filtering

Collaborative filtering is becoming a popular technique for reducing information overload. However, most of current collaborative filtering algorithms have three major limitations: accuracy, data sparsity and scalability. In this paper, we propose a new collaborative filtering algorithm to solve the problem of data sparsity and improve the prediction accuracy. If the rated items amount of a user is less than some threshold, the algorithm utilizes the output of latent class models for neighbors selection, then uses the neighborhood-based method to produce the prediction of unrated items, otherwise it predicts the rating using the STIN1 method. Our experimental results show that our algorithm outperforms the conventional neighborhood-based method and the STIN1 method.

Xiaohua Sun, Fansheng Kong, Xiaobing Yang, Song Ye
A Polynomial Smooth Support Vector Machine for Classification

A new polynomial smooth method for solving the support vector machine (SVM) is presented in this paper. It is called the polynomial smooth support vector machine (PSSVM). BFGS method and Newton-Armijo method are applied to solve the PSSVM. Numerical experiments confirm that PSSVM is more effective than SVM.

YuBo Yuan, TingZhu Huang
Reducts in Incomplete Decision Tables

Knowledge reduction is an important issue in data mining. This paper focuses on the problem of knowledge reduction in incomplete decision tables. Based on a concept of incomplete conditional entropy, a new reduct definition is presented for incomplete decision tables and its properties are analyzed. Compared with several existing reduct definitions, the new definition has a better explanation for knowledge uncertainty and is more convenient for application of the idea of approximate reduct in incomplete decision tables.

Renpu Li, Dao Huang
Learning k-Nearest Neighbor Naive Bayes for Ranking

Accurate probability-based ranking of instances is crucial in many real-world data mining applications. KNN (

k

-nearest neighbor) [1] has been intensively studied as an effective classification model in decades. However, its performance in ranking is unknown. In this paper, we conduct a systematic study on the ranking performance of KNN. At first, we compare KNN and KNNDW (KNN with distance weighted) to decision trees and naive Bayes in ranking, measured by AUC (the area under the Receiver Operating Characteristics curve). Then, we propose to improve the ranking performance of KNN by combining KNN with naive Bayes. The idea is that a naive Bayes is learned using the

k

nearest neighbors of the test instance as the training data and used to classify the test instance. A critical problem in combining KNN with naive Bayes is the lack of training data when

k

is small. We propose to deal with it using sampling to expand the training data. That is, each of the

k

nearest neighbors is “cloned” and the clones are added to the training data. We call our new model instance cloning local naive Bayes (simply ICLNB). We conduct extensive empirical comparison for the related algorithms in two groups in terms of AUC, using the 36 UCI datasets recommended by Weka[2]. In the first group, we compare ICLNB with other types of algorithms C4.4[3], naive Bayes and NBTree[4]. In the second group, we compare ICLNB with KNN, KNNDW and LWNB[5]. Our experimental results show that ICLNB outperforms all those algorithms significantly. From our study, we have two conclusions. First, KNN-relates algorithms performs well in ranking. Second, our new algorithm ICLNB performs best among the algorithms compared in this paper, and could be used in the applications in which an accurate ranking is desired.

Liangxiao Jiang, Harry Zhang, Jiang Su
One Dependence Augmented Naive Bayes

In real-world data mining applications, an accurate ranking is as important as an accurate classification. Naive Bayes has been widely used in data mining as a simple and effective classification and ranking algorithm. Since its conditional independence assumption is rarely true, numerous algorithms have been proposed to improve naive Bayes, for example, SBC[1] and TAN[2]. Indeed, the experimental results show that SBC and TAN achieve a significant improvement in term of classification accuracy. However, unfortunately, our experiments also show that SBC and TAN perform even worse than naive Bayes in ranking measured by AUC[3,4](the area under the Receiver Operating Characteristics curve). This fact raises the question of whether we can improve Naive Bayes with both accurate classification and ranking? In this paper, responding to this question, we present a new learning algorithm called One Dependence Augmented Naive Bayes(ODANB). Our motivation is to develop a new algorithm to improve Naive Bayes’ performance not only on classification measured by accuracy but also on ranking measured by AUC. We experimentally tested our algorithm, using the whole 36 UCI datasets recommended by Weka[5], and compared it to Naive Bayes, SBC and TAN. The experimental results show that our algorithm outperforms all the other algorithms significantly in yielding accurate ranking, yet at the same time outperforms all the other algorithms slightly in terms of classification accuracy.

Liangxiao Jiang, Harry Zhang, Zhihua Cai, Jiang Su

Clustering

A Genetic k-Modes Algorithm for Clustering Categorical Data

Many optimization based clustering algorithms suffer from the possibility of stopping at locally optimal partitions of data sets. In this paper, we present a genetic

k

-Modes algorithm(GKMODE) that finds a globally optimal partition of a given categorical data set into a specified number of clusters. We introduce a

k

-Modes operator in place of the normal crossover operator. Our analysis shows that the clustering results produced by GKMODE are very high in accuracy and it performs much better than existing algorithms for clustering categorical data.

Guojun Gan, Zijiang Yang, Jianhong Wu
A Fast Fuzzy Clustering Algorithm for Large-Scale Datasets

The transitive closure method is one of the most frequently used fuzzy clustering techniques. It has

O

(

n

3

log

2

n

) time complexity and

O

(

n

2

) space complexity for matrix compositions while building transitive closures. These drawbacks limit its further applications to large-scale databases. In this paper, we proposed a fast fuzzy clustering algorithm to avoid matrix multiplications and gave a principle, where the clustering results were directly obtained from the

λ

-cut of the fuzzy similar relation of objects. Moreover, it was dispensable to compute and store the similar matrix of objects beforehand. The time complexity of the presented algorithm is

O

(

n

2

) at most and the space complexity is

O

(1). Theoretical analysis and experiments demonstrate that although the new algorithm is equivalent to the transitive closure method, the former is more suitable to treat large-scale datasets because of its high computing efficiency.

Lukui Shi, Pilian He
Clustering with Noising Method

The minimum sum of squares clustering problem is a nonconvex program which possesses many locally optimal values, resulting that its solution often falls into these traps. In this article, a recent metaheuristic technique, the noising method, is introduced to explore the proper clustering of data sets under the criterion of minimum sum of squares clustering. Meanwhile, K-means algorithm as a local improvement operation is integrated into the noising method to improve the performance of the clustering algorithm. Extensive computer simulations show that the proposed approach is feasible and effective.

Yongguo Liu, Yan Liu, Kefei Chen
Extracting the Representative Failure Executions via Clustering Analysis Based on Markov Profile Model

During the debugging of a program to be released, it is unnecessary and impractical for developers to check every failure execution. How to extract the typical ones from the vast set of failure executions is very important for reducing the debugging efforts. In this paper, a revised Markov model used to depict program behaviors is presented firstly. Based on this model, the dissimilarity of two profile matrixes is also defined. After separating the failure executions and non-failure executions into two different subsets, iterative partition clustering and a sampling strategy called priority-ranked n-per-cluster are employed to extract representative failure executions. Finally, with the assistance of our prototype CppTest, we have performed experiment on five subject programs. The results show that the clustering and sampling techniques based on revised Markov model is more effective to find faults than Podgurski’s method.

Chengying Mao, Yansheng Lu
Improvement on the Approximation Bound for Fuzzy-Neural Networks Clustering Method with Gaussian Membership Function

A great deal of research has been devoted in recent years to the designing Fuzzy-Neural Networks (FNN) from input-output data. And some works were also done to analyze the performance of some methods from a rigorous mathematical point of view. In this paper, a new approximation bound for the clustering method, which is employed to design the FNN with the Gaussian Membership Function, is established. It is an improvement of the previous result in which the related approximation bound was somewhat complex. The detailed formulas of the error bound between the nonlinear function to be approximated and the FNN system designed based on the input-output data are derived.

Weimin Ma, Guoqing Chen
Optimal Fuzzy Modeling Based on Minimum Cluster Volume

This paper proposes a new fuzzy modeling method, which involves the Minimum Cluster Volume clustering algorithm. The cluster centers founded are naturally considered to be the centers of Gaussian membership functions. Covariance matrix obtained from the result of cluster method is made use to estimate the parameters

σ

for Gaussian membership functions. A direct result of this method are compared in our simulations with published methods, which indicate that our method is powerful so that it solves the multi-dimension problems more accurately even with less complexity of our fuzzy model structure.

Can Yang, Jun Meng
An Efficient Clustering Approach for Large Document Collections

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.

Bo Han, Lishan Kang, Huazhu Song
Clustering Categorical Data Using Coverage Density

In this paper, a new algorithm based on the idea of coverage density is proposed for clustering categorical data. It uses average coverage density as the global criterion function. Large sparse categorical databases can be clustered effectively by using this algorithm. It shows that the algorithm uses less memory and time by analyzing its time and space complexity. Experiments on two real datasets are carried out to illustrate the performance of the proposed algorithm.

Hua Yan, Lei Zhang, Yi Zhang

Novel Algorithms

A New Support Vector Machine for Data Mining

This paper proposes a new support vector machine (SVM) with a robust loss function for data mining. Its dual optimal formation is also constructed. A gradient based algorithm is designed for fast and simple implementation of the new support vector machine. At the same time it analyzes algorithm’s convergence condition and gives a formula to select learning step size. Numerical simulation results show that the new support vector machine performs significantly better than a standard support vector machine.

Haoran Zhang, Xiaodong Wang, Changjiang Zhang, Xiuling Xu
The Infinite Polynomial Kernel for Support Vector Machine

This paper develops an infinite polynomial kernel

k

c

for support vector machines. We also propose a mapping from an original data space into the high dimensional feature space on which the inner product is defined by the infinite polynomial kernel

k

c

. Via this mapping, any two finite sets of data in the original space will become linearly separable in the feature space. Numerical experiments indicate that the proposed infinite polynomial kernel possesses some properties and performance better than the existing finite polynomial kernels.

Degang Chen, Qiang He, Xizhao Wang
Routing Attribute Data Mining Based on Rough Set Theory

QOSPF (Quality of Service Open Shortest Path First) based on QoS routing has been recognized as a missing piece in the evolution of QoS-based service offerings in the Internet. A data mining method for QoS routing based on rough set theory has been presented in this paper. The information system about the link is created from the subnet, and the method of rough set can mine the best route from enormous irregular link QoS data and can classify the link with the link-status data. An instance applying to the theory is presented, which verifies the feasibility that the most excellent attribute set is mined by rough set theory for compatible data table.

Yanbing Liu, Hong Tang, Menghao Wang, Shixin Sun
A Novel Data Mining Method Based on Ant Colony Algorithm

Data mining has become of great importance owing to ever-increasing amounts of data collected by large organizations. This paper propose an data mining algorithm called Ant-Miner(I),which is based on an improvement of Ant Colony System(ACS) algorithm. Experimental results show that Ant-Miner(I) has a higher predictive accuracy and much smaller rule list than the original Ant-Miner algorithm.

Weijin Jiang, Yusheng Xu, Yuhui Xu
Context-Sensitive Regression Analysis for Distributed Data

A precondition of existing ensemble-based distributed data mining techniques is the assumption that contributing data are identically and independently distributed. However, this assumption is not valid in many virtual organization contexts because contextual heterogeneity exists. Focusing on regression tasks, this paper proposes a context-based meta-learning technique for horizontally partitioned data with contextual heterogeneity. The predictive performance of our new approach and the state of the art techniques are evaluated and compared on both simulated and real-world data sets.

Yan Xing, Michael G. Madden, Jim Duggan, Gerard J. Lyons
Customer Churn Prediction Using Improved One-Class Support Vector Machine

Customer Churn Prediction is an increasingly pressing issue in today’s ever-competitive commercial arena. Although there are several researches in churn prediction, but the accuracy rate, which is very important to business, is not high enough. Recently, Support Vector Machines (SVMs), based on statistical learning theory, are gaining applications in the areas of data mining, machine learning, computer vision and pattern recognition because of high accuracy and good generalization capability. But there has no report about using SVM to Customer Churn Prediction. According to churn data set characteristic, the number of negative examples is very small, we introduce an improved one-class SVM. And we have tested our method on the wireless industry customer churn data set. Our method has been shown to perform very well compared with other traditional methods, ANN, Decision Tree, and Naïve Bays.

Yu Zhao, Bing Li, Xiu Li, Wenhuang Liu, Shouju Ren
The Application of Adaptive Partitioned Random Search in Feature Selection Problem

Feature selection is one of important and frequently used techniques in data preprocessing. It can improve the efficiency and the effectiveness of data mining by reducing the dimensions of feature space and removing the irrelevant and redundant information. Feature selection can be viewed as a global optimization problem of finding a minimum set of M relevant features that describes the dataset as well as the original N attributes. In this paper, we apply the adaptive partitioned random search strategy into our feature selection algorithm. Under this search strategy, the partition structure and evaluation function is proposed for feature selection problem. This algorithm ensures the global optimal solution in theory and avoids complete randomness in search direction. The good property of our algorithm is shown through the theoretical analysis.

Xiaoyan Liu, Huaiqing Wang, Dongming Xu
Heuristic Scheduling of Concurrent Data Mining Queries

Execution cost of batched data mining queries can be reduced by integrating their I/O steps. Due to memory limitations, not all data mining queries in a batch can be executed together. In this paper we introduce a heuristic algorithm called CCFull,which suboptimally schedules the data mining queries into a number of execution phases. The algorithm significantly outperforms the optimal approach while providing a very good accuracy.

Marek Wojciechowski, Maciej Zakrzewicz
Using Gap-Insensitive String Kernel to Detect Masquerading

Masquerade attacks may be one of the most serious attacks in computer security context. To avoid being detected, masqueraders sometimes insert some common commands such as “ls” into their command sequences intentionally for concealing their actual purpose. This causes the masquerade attacks difficult to be detected. We refer to these command sequences mixed with confusable commands as gap-insensitive. To eliminate the effects on the insertion, we present a string kernel called gap-insensitive kernel without regard to the gaps in the command sequences, and use it to detect masquerade attacks. We test it and other kernels on the dataset from keyboard commands on a UNIX platform. We find that many users’ attacks against other users can be easily detected by our gap-insensitive kernel, which means that the command sequences of these attackers are gap-insensitive. The results reveal that gap-insensitive kernel can determine gap-insensitivity in command sequences, and efface the gaps in the sequences.

Chuanhuan Yin, Shengfeng Tian, Shaomin Mu
A New Method for Linear Ill-Posed Problems: Iteration Method by Rectifying Eigenvalue

In order to overcome the weaknesses of Regularization Method for linear ill-posed problem, the authors suggest a new method named Iteration Method by Rectifying Eigenvalue (IMRE) in this paper. Firstly, the rigorous theoretical proofs that IMRE can achieve convergent and unbiased solution are given. Then an effective method called L-Curve method is introduced to determine parameter

α

in IMRE. Thirdly, a computing program is designed. Finally an example is given to testify the advantages of IMRE by the above program.

Yugang Tian, Peijun Shi, Xinzhou Wang, Kun Qin

Text Mining

A Non-VSM kNN Algorithm for Text Classification

The text classification problem, which is the task of assigning natural language texts to predefined categories based on their content, has been widely studied. Traditional text classification use VSM (Vector Space Model), which views documents as vectors in high dimensional spaces, to represent documents. In this paper, we propose a non-VSM kNN algorithm for text classification. Based on correlations between categories and features, the algorithms first get k F-C tuples, which are the first k tuples in term of correlation value, from an unlabeled document. Then the algorithm predicts the category of the unlabeled documents via these tuples. We have evaluated the algorithm on two document collections and compared it against traditional kNN. Experimental results show that our algorithm outperforms traditional kNN in both efficiency and effectivity.

Zhi-Hong Deng, Shi-Wei Tang
A Study on Text Clustering Algorithms Based on Frequent Term Sets

In this paper, a new text-clustering algorithm named Frequent Term Set-based Clustering (FTSC) is introduced. It uses frequent term sets to cluster texts. First, it extracts useful information from documents and inserts into databases. Then, it uses the Apriori algorithm based on association rules mining efficiently to discover the frequent items sets. Finally, it clusters the documents according to the frequent words in subsets of the frequent term sets. This algorithm can reduce the dimension of the text data efficiently for very large databases, thus it can improve the accuracy and speed of the clustering algorithm. The results of clustering texts by the FTSC algorithm cannot reflect the overlap of texts’ classes. Based on the FTSC algorithm, an improved algorithm—Frequent Term Set-based Hierarchical Clustering algorithm (FTSHC) is given. This algorithm can determine the overlap of texts’ classes by the overlap of the frequent words sets, and provide an understandable description of the discovered clusters by the frequent terms sets. The FTSC, FTSHC and K-Means algorithms are evaluated quantitatively by experiments. The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering.

Xiangwei Liu, Pilian He
An Improvement of Text Association Classification Using Rules Weights

Recently, categorization methods based on association rules have been given much attention. In general, association classification has the higher accuracy and the better performance. However, the classification accuracy drops rapidly when the distribution of feature words in training set is uneven. Therefore, text categorization algorithm Weighted Association Rules Categorization (WARC) is proposed in this paper. In this method, association rules are used to classify training samples and rule intensity is defined according to the number of misclassified training samples. Each strong rule is multiplied by factor less than 1 to reduce its weight while each weak rule is multiplied by factor more than 1 to increase its weight. The result of research shows that this method can remarkably improve the accuracy of association classification algorithms by regulation of rules weights.

Xiao-Yun Chen, Yi Chen, Rong-Lu Li, Yun-Fa Hu
Word Segmentation and POS Tagging for Chinese Keyphrase Extraction

Keyphrases are essential for many text mining applications. In order to automatically extracting keyphrases from Chinese text, an extraction system is proposed in this paper. To access a particular problem of Chinese information processing, a lexicon-based word segmentation approach is presented. For this purpose, a verb lexicon, a functional word lexicon and a stop word lexicon are constructed. A predefined keyphrase lexicon is applied to improve the performance of extraction. The approach uses a small Part-Of-Speech(POS) tagset to index phrases simply according to these lexicons. It is especially effective for identifying phrases in form of combinations of nouns, adjectives and verbs. Keyphrases are sifted by their weighted TF-IDF (Term occurrence Frequency-Inverse Document Frequency) values. New keyphrases are added into the keyphrase lexicon.

Xiaochun Huang, Jian Chen, Puliu Yan, Xin Luo
Learning User Profiles from Text in e-Commerce

Exploring digital collections to find information relevant to a user’s interests is a challenging task. Algorithms designed to solve this

relevant information problem

base their relevance computations on

user profiles

in which representations of the users’ interests are maintained. This paper presents a new method, based on the classical Rocchio algorithm for text categorization, able to discover user preferences from the analysis of textual descriptions of items in online catalogues of e-commerce Web sites. Experiments have been carried out on a dataset of real users, and results have been compared with those obtained using an Inductive Logic Programming (ILP) approach and a probabilistic one.

M. Degemmis, P. Lops, S. Ferilli, N. Di Mauro, T. M. A. Basile, G. Semeraro

Multimedia Mining

Data Mining Based on Objects in Video Flow with Dynamic Background

This paper presents a model OMDB for mining the region information of non-rigid foreground object in video flow with dynamic background. The model constructs RDM algorithm and optimize the strategy of region matching using Q-learning to obtain better motion information of regions. Moreover, OMDB utilizes NEA algorithm to detect and merge gradually object regions of foreground based on the characteristics that there is motion difference between foreground and background and the regions of an object maintain integrality during moving. Experimental results on extracting region information of foreground object and tracking the object are presented to demonstrate the efficacy of the proposed model.

Cheng Zeng, JiaHeng Cao, Ying Fang, Pei Du
An Approach to Compressed Image Retrieval Based on JPEG2000 Framework

As the latest effort by JPEG in international standardization of still image compression, JPEG2000 contains a range of important functionalities superior to its earlier DCT based versions. In the expectation that the compression standard will become an important digital format for many images and photographs, we present our recent work in this paper on image indexing and retrieval directly in wavelets domain, which is suitable for JPEG2000 compressed image retrieval without involving its full decompression. Our methods mainly extract histogram features from those significant wavelet coefficients according to the EBCOT of JPEG2000 for compressed image retrieval. While our method gains the advantage of eliminating decompression, the experiments also support that the retrieving accuracy is better than the existing counterparts.

Jianguo Tang, Wenyin Zhang, Chao Li
Target Segmentation and Feature Extraction for Undersea Image Based on Function Transformation

Because of the specialty of undersea channel and the complexity of undersea environment, many uncertain factors affect the quality of undersea image. Consequently, it is a difficult problem to segment and identify targets for undersea image. In this paper, a novel target segmentation and feature extraction approach for undersea image based on function transformation is presented. The approach overcomes the influence of complex environment and uneven illumination effectively. Experimental results demonstrate that the approach is valid for target segmentation and feature extraction for undersea hydrothermal vent image.

Fuyuan Peng, Yan Tian, Xi Yu, Guohua Xu, Qian Xia
ART in Image Reconstruction with Narrow Fan-Beam Based on Data Mining

Image reconstruction is one of the key technologies of industrial computed tomography. Algebraic method has un-replaceable advantage when the data is incomplete or the noise effect is high because of data mining. However the use of algebraic method has been highly limited because of the low speed reconstruction. In this paper, a new iterative method (algorithm reconstruction technique) is introduced to accelerate the iteration process and increase the reconstruction speed. Besides, algebraic reconstruction method will be used more widely with the development of computer technology and increase of computer speed. Experiment results clearly demonstrate that algorithm reconstruction technique can efficiently improve quality of images reconstruction when processing the incomplete projection data or noisy projection data based on data mining.

Zhong Qu, Junhao Wen, Dan Yang, Ling Xu, Yu Wu
Digits Speech Recognition Based on Geometrical Learning

We investigate the use of independent component analysis (ICA) for speech feature extraction in digits speech recognition systems.We observe that this may be true for a recognition tasks based on geometrical learning with little training data. In contrast to image processing, phase information is not essential for digits speech recognition. We therefore propose a new scheme that shows how the phase sensitivity can be removed by using an analytical description of the ICA-adapted basis functions via the Hilbert transform. Furthermore, since the basis functions are not shift invariant, we extend the method to include a frequency-based ICA stage that removes redundant time shift information. The digits speech recognition results show promising accuracy, Experiments show method based on ICA and geometrical learning outperforms HMM in different number of train samples.

Wenming Cao, Xiaoxia Pan, Shoujue Wang, Jing Hu
A Novel Information Hiding Technique for Remote Sensing Image

In this paper, we introduce an information hiding technique into remote sensing area. We develop its connotation that the secret information is still hidden in the original remote sensing image. We propose a practical information hiding technique and a novel wavelet information hiding algorithm which is able to adapt to features of a remote sensing image. The technique is based on the embedding strategy of Discrete Wavelet Transform and HVS (Human Visual System) character. The algorithm is a blind one and has no influence on applied value of a remote sensing image.

Xianmin Wang, Zequn Guan, Chenhan Wu
Content-Based News Video Mining

It is a challenging issue to analyze video content for video mining due to the difficulty in video representation. A hierarchical model of video representation is proposed with a schema for content-based analysis of news video in this paper. The research problem targeted in this paper is to mine a massive video database to retrieve specific clip based on content defined by users. This is frequently encountered in entertainment and video editing. A novel solution to this problem is developed in this paper, in which the consecutive news video is segmented into shots, scenes and news items using multimodal features based on the hierarchical model. To summarize the content of video, a video abstract is developed. The experimental evaluation demonstrates the effectiveness of the approaches discussed in this paper.

Junqing Yu, Yunfeng He, Shijun Li
Automatic Image Registration via Clustering and Convex Hull Vertices Matching

A coarse-to-fine automatic point-based image registration method is proposed in this paper. At the first stage, clustering is used to determine the scale parameter and the rotational parameter candidates between images. Convex hull vertices correlation is applied subsequently to determine the correct rotational parameter. With the coordinates of matched point pairs and the above parameters, the translational parameter and the coarse registration result can be determined. At the second stage, control point pairs, which determine parameters of mapping polynomial, are formed by iterative convex hull vertices matching. Thus the registration result is refined. Experiments indicate that this approach can automatically align images in different resolutions.

Xiangyu Yu, Hong Sun
Fingerprint Image Segmentation Based on Gaussian-Hermite Moments

An important step in automatic fingerprint recognition systems is the segmentation of fingerprint images. In this paper, we present an adaptive algorithm based on Gaussian-Hermite moments for non-uniform background removing in fingerprint image segmentation. Gaussian-Hermite moments can better separate image features based on different modes. We use Gaussian-Hermite moments of different orders to separate background and foreground of fingerprint image. Experimental results show that the use of Gaussian-Hermite moments makes a significant improvement for the segmentation of fingerprint images.

Lin Wang, Hongmin Suo, Mo Dai

Sequential Data Mining and Time Series Mining

HVSM: A New Sequential Pattern Mining Algorithm Using Bitmap Representation

Sequential pattern mining is an important problem for data mining with broad applications. This paper presents a first-Horizontal-last-Vertical scanning database Sequential pattern Mining algorithm (HVSM). HVSM considers a database as a vertical bitmap. The algorithm first extends itemsets horizontally, and digs out all one-large-sequence itemsets. It then extends the sequence vertically and generates candidate large sequence. The candidate large sequence is generated by taking brother-nodes as child-nodes. The algorithm counts the support by recording the first TID mark (1

st

-TID). Experiments show that HVSM algorithm can find frequent sequences faster than SPAM algorithm in mining the large transaction databases.

Shijie Song, Huaping Hu, Shiyao Jin
HGA-COFFEE : Aligning Multiple Sequences by Hybrid Genetic Algorithm

For multiple sequence alignment problem in molecular biological sequence analysis, a hybrid genetic algorithm and an associated software package called HGA-COFFEE are presented. The COFFEE function is used to measure individual fitness, and five novel genetic operators are designed, a selection operator, two crossover operators and two mutation operators. One of the mutation operators is designed based on the COFFEE’s consistency information that can improve the global search ability, and another is realized by dynamic programming method that can improve individuals locally. Experimental results of the 144 benchmarks from the BAliBASE show that the proposed algorithm is feasible, and for datasets in twilight zone and comprising N/C terminal extensions, HGA-COFFEE generates better alignment as compared to other methods studied in this paper.

Li-fang Liu, Hong-wei Huo, Bao-shu Wang
Independent Component Analysis for Clustering Multivariate Time Series Data

Independent Component Analysis (ICA) is a useful statistical method for separating mixed data sources into statistically independent patterns. In this paper, we apply ICA to transform multivariate time series data into independent components (ICs), and then propose a clustering algorithm called ICACLUS to group underlying data series according to the ICs found. This clustering algorithm can be used to identify stocks with similar stock price movement. The experiments show that this method is effective and efficient, which also outperforms other comparable clustering methods, such as K-means.

Edmond H. C. Wu, Philip L. H. Yu
Applying Fuzzy Neural Network to Intrusion Detection Based on Sequences of System Calls

Short sequences of system calls have been proven to be a good signature description for anomalous intrusion detection. The signature provides clear separation between different kinds of programs. This paper extends these works by applying fuzzy neural network (FNN) to solve the sharp boundary problem and decide whether a sequence is “normal” or “abnormal”. By using threat level of system calls to label the sequences the proposed FNN improves the accuracy of anomaly detection.

Guiling Zhang, Jizhou Sun

Web Mining

Design and Implementation of Web Mining System Based on Multi-agent

Some challenges for website designers are to provide correct and useful information to individual user with different backgrounds and interests, as well as to increase user satisfaction. Most existing Web search tools work only with individual users and do not help a user benefit from previous search experience of others. In this paper, a collaborative Web Mining System, Collector Engine System is presented, a multi-agent system designed to provide post-retrieval analysis and enable across-user collaboration in web search and mining. This system allows the user to annotate search sessions and share them with other users. The prototype system and component of Collector Engine System is discussed and described, and especially designs the web Agent, the knowledge discovery of web Agent is extracted based on a combination of web usage mining and machine learning. The system model is established and realized by J2EE technology. The system’s application shows that subjects’ search performances are improved, compared to individual search scenarios, in which users have no access to previous searches, when they have access to a limited of earlier search session done by other users.

Wenbin Hu, Bo Meng
A Novel Framework for Web Page Classification Using Two-Stage Neural Network

Web page classification is one of the essential techniques for Web mining. This paper presents a framework for Web page classification. It is hybrid architecture of neural network PCA (principle components analysis) and SOFM (self-organizing map). In order to perform the classification, a web page is firstly represented by a vector of features with different weights according to the term frequency and the importance of each sentence in the page. As the number of the features is big, PCA is used to select the relevant features. Finally the output of PCA is sent to SOFM for classification. To compare with the proposed framework, two conventional classifiers are used in our experiments: k-NN and Naïve Bayes. Our new method makes a significant improvement in classifications on both data sets compared with the two conventional methods.

Yunfeng Li, Yukun Cao, Qingsheng Zhu, Zhengyu Zhu
Fuzzy Evaluation of Hotel Websites

Prior studies on hotel website performance have primarily concentrated on frequency counting, content analysis or user behavioral approaches. These studies, however, failed to offer any insight that can accurately evaluate hotel website quality based on users’ assessment of attribute weights and performance ratings. This research proposes a fuzzy multicriteria analysis model which systematically integrates hotel guests’ preferences and fuzzy assessments of website attributes. The fuzzy evaluation of linguistic values by respondents offers a comprehensive approach for handling incomplete and imprecise user preferences to capture the realistic evaluation process. The research output will be a set of overall performance indices representing the cumulative effect of website features, which will offer a benchmark enabling hotels to evaluate their websites’ relative performance and ranking based on all relevant weighted attributes.

Rob Law
Querying Web Images by Topic and Example Specification Methods

Ever since the advent of Internet, there has been an immense growth in the amount of image data that is available on the World Wide Web. With such a magnitude of image availability, an efficient and effective image retrieval system is required to make use of this information. This research presents an image matching and indexing technique that improvises on existing integrated image retrieval methods. The proposed system integrates query by topic and query by example specification methods. The topic-based image retrieval uses the structured format of HTML documents to retrieve relevant pages and potential match images. The query by example specification performs content-based image match for the retrieval of smaller and relatively closer results of the example image. The main goal is to develop a functional image search and indexing system without using a database and to demonstrate that better retrieval results can be achieved with this proposed hybrid search technique.

Ching-Cheng Lee, Rashmi Prabhakara
The Research on Fuzzy Data Mining Applied on Browser Records

With the technological advances, the Internet has been an important part of everyday life. Governmental institutions and enterprises tend to advertise and market through the internet. With the travelling records of browsers, one can analyze the preference of web pages, further understand the demands of consumers, and promote the advertising and marketing. In this study, we use Maximum Forward Reference (MFR) algorithm to find the travel pattern of browsers from web logs. Simultaneously, experts are asked to evaluate the fuzzy importance weightings for different webs. Finally, we employ fuzzy data mining technique that combines apriori algorithm with fuzzy weights to determine the association rules. From the yielded association rules, one can be accurately aware of the information consumers need and which webs they prefer. This is important to governmental institutions and enterprises. Enterprises can find the commercial opportunities and improve the design of webs by means of this study. Governmental institutions can realize the needs of people from the obtained association rules, make the promotion of policy more efficiently, and provide better services.

Qingzhan Chen, Jianghong Han, Yungang Lai, Wenxiu He, Keji Mao
Discovering Conceptual Page Hierarchy of a Web Site from User Traversal History

A Web site generally contains a wide range of topics which provide information for users who have different access interests and goals. This information is not randomly scattered, but well organized under a hierarchy encoded in the hyperlink structure of a Web site. It is intended to mold the user’s mental models of how the information is organized. On the other hand, user traversals over hyperlinks between Web pages can reveal semantic relationships between these pages. Unfortunately, the link structure of a Web site which represent the Web designer’s expectation on visitors may be quite different from the organization expected by visitors to this site. Discovering the conceptual page hierarchy from a user’s angle can help web masters to have an sight into real relationships among the Web pages and refine the link structure of the Web site to facilitate effective user navigation. In this paper, we propose a method to generate a conceptual page hierarchy of a Web site on the basis of user traversal history. We use maximal forward references to model user’s traversal behavior over the underlying link hierarchy of a Web site. We then build a weighted directed graph to represent the inter-relationships between Web pages. Finally we apply a “

Maximum Spanning Tree

” (MST) algorithm to generate a conceptual page hierarchy of the Web site. We demonstrate the effectiveness of our approach by conducting a preliminary experiment based on a real world Web data.

Xia Chen, Minqiang Li, Wei Zhao, Ding-Yi Chen

Biomedical Mining

Bayesian Neural Networks for Prediction of Protein Secondary Structure

A novel approach is developed for Protein Secondary Structure Prediction based on Bayesian Neural Networks (BNN). BNN usually outperforms the traditional Back-Propagation Neural Networks (BPNN) due to its excellent ability to control the complexity of the model. Results indicates that BNN has an average overall three-state accuracy

Q

3

increase 3.65% and 4.01% on the 4-fold cross-validation data sets and TEST data set respectively, comparing with the traditional BPNN. Meanwhile, a so-called

cross-validation choice of starting values

is presented, which will shorten the burn-in phase during the MCMC (Markov Chain Monte Carlo) simulation substantially.

Jianlin Shao, Dong Xu, Lanzhou Wang, Yifei Wang
PromPredictor: A Hybrid Machine Learning System for Recognition and Location of Transcription Start Sites in Human Genome

In this paper we present a novel hybrid machine learning system for recognition of gene starts in human genome. The system makes predictions of gene start by extracting compositional features and CpG islands information from promoter regions. It combines a new promoter recognition model, coding theory, feature selection and dimensionality reduction with machine learning algorithm. Evaluation on Human chromosome 4, 21, 22 was 64.47% in sensitivity and 82.20% in specificity. Comparison with the three other systems revealed that our system had superior sensitivity and specificity in predicting gene starts. PromPredictor is written in MATLAB and requires Matlab to run. PromPredictor is freely available at www.whtelecom.com/Prompredictor.htm.

Tao Li, Chuanbo Chen
Robust Ensemble Learning for Cancer Diagnosis Based on Microarray Data Classification

DNA microarray technology has demonstrated to be an effective methodology for the diagnosis of cancers by means of microarray data classification. Although much research has been conducted during the recent years to apply machine learning techniques for microarray data classification, there are two important issues that prevent the use of conventional machine learning techniques, namely the limited availability of training samples and the existence of various uncertainties (e.g. biological variability and experiment variability). This paper presents a new ensemble machine learning approach to address these issues in order to achieve a robust microarray data classification. Ensemble learning combines a set of base classifiers as a committee to make appropriate decisions when classifying new data instances. In order to enhance the performance of the ensemble learning process, the approach presented includes a procedure to select optimal ensemble members that maximize the behavioural diversity. The proposed approach has been verified by three microarray datasets for cancer diagnosis. Experimental results have demonstrated that the classifier constructed by the proposed method outperforms not only the classifiers generated by the conventional machine learning techniques, but also the classifiers generated by two widely-used conventional Bagging and Boosting ensemble learning methods.

Yonghong Peng
A Comprehensive Benchmark of the Artificial Immune Recognition System (AIRS)

Artificial Immune Systems are a new class of algorithms inspired by how the immune system recognizes, attacks and remembers intruders. This is a fascinating idea, but to be accepted for mainstream data mining applications, extensive benchmarking is needed to demonstrate the reliability and accuracy of these algorithms. In our research we focus on the AIRS classification algorithm. It has been claimed previously that AIRS consistently outperforms other algorithms. However, in these papers AIRS was compared to benchmark results from literature. To ensure consistent conditions we carried out benchmark tests on all algorithms using exactly the same set up. Our findings show that AIRS is a stable and robust classifier that produces around average results. This contrasts with earlier claims but shows AIRS is mature enough to be used for mainstream data mining.

Lingjun Meng, Peter van der Putten, Haiyang Wang
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

It is well accepted that many real-life datasets are full of missing data. In this paper we introduce, analyze and compare several well known treatment methods for missing data handling and propose new methods based on Naive Bayesian classifier to estimate and replace missing data. We conduct extensive experiments on datasets from UCI to compare these methods. Finally we apply these models to a geriatric hospital dataset in order to assess their effectiveness on a real-life dataset.

Peng Liu, Elia El-Darzi, Lei Lei, Christos Vasilakis, Panagiotis Chountas, Wei Huang
Parallel Genetic Algorithm and Parallel Simulated Annealing Algorithm for the Closest String Problem

In this paper, we design genetic algorithm and simulated annealing algorithm and their parallel versions to solve the Closest String Problem. Our implementation and experiments show usefulness of the parallel GA and SA algorithms.

Xuan Liu, Hongmei He, Ondrej Sýkora
Mining Interesting Association Rules in Medical Images

Image mining is more than just an extension of data mining to image domain but an interdisciplinary endeavor. Very few people have systematically investigated this field. Mining association rules in medical images is an important part in domain-specific application image mining because there are several technical aspects which make this problem challenging. In this paper, we extend the concept of association rule based on object and image in medical images, and propose two algorithms to discover frequent item-sets and mine interesting association rules from medical images. We describe how to incorporate the domain knowledge into the algorithms to enhance the interestingness. Some interesting results are obtained by our program and we believe many of the problems we come across are likely to appear in other domains.

Haiwei Pan, Jianzhong Li, Zhang Wei
Hybrid Feature Ranking for Proteins Classification

Hybrid feature ranking is a feature selection method which combines the quickness of the filter approach and the accuracy of the wrapper approach. The main idea consists in a two steps procedure: building a sequence of feature subsets using an informational criterion, independently of the learning method; selecting the best one with a cross-validation error rate evaluation, using explicitly the learning method. In this paper, we show that in the protein discrimination domain, few examples but numerous descriptors, compared to a traditional approach where each descriptor is evaluated separately in the first step, to take account of their redundancy in the construction of candidate subsets of features reduces the size of the optimal subset and improves, in certain cases, the accuracy.

Ricco Rakotomalala, Faouzi Mhamdi, Mourad Elloumi
Predicting Subcellular Localization of Proteins Using Support Vector Machine with N-Terminal Amino Composition

Prediction of protein subcellular localization is one of the hot research topics in bioinformatics. In this paper, several support vector machines (SVM) with a new presented coding scheme method based on N-terminal amino compositions are first trained to discriminate between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and ‘other’ localizations. Then a decision unit is used to make the final prediction based on several SVMs’ outputs. Tested on redundancy-reduced sets, the proposed method reached 89.6 % (plant) and 91.9% (non-plant) total accuracies, which, to the best of our knowledge, are the highest ever reported using the same data sets.

Yan-fu Li, Juan Liu

Advanced Applications

The Dynamic Character Curve Adjusting Model of Electric Load Based on Data Mining Theory

There are a number of dirty data in the load database produced by SCADA system. Consequently, the data must be adjusted carefully and reasonably before being used for electric load forecasting or power system analysis. This paper proposes a dynamic and intelligent curve adjusting model based on data mining theory. Firstly the Kohonen neural network is meliorated according to fuzzy soft clustering arithmetic which can realize the collateral calculation of Fuzzy c-means soft clustering arithmetic. The proposed dynamic algorithm can automatically find the new clustering center, namely, the character curve of data, according to the updating of swatch data. Combining an RBF neural network with this dynamic algorithm, the intelligent adjusting model is introduced to identify the dirty data. The rapidness and dynamic performance of model make it suitable for real-time calculation. Test results using actual data of Jiangbei power supply bureau in Chongqing demonstrate the effectiveness and feasibility of the model.

Xiaoxing Zhang, Haijun Ren, Yuming Liu, Qiyun Cheng, Caixin Sun
Using Boosting Learning Method for Intrusion Detection

It is an important research topic to improve detection rate and reduce false positive rate of detection model in the field of intrusion detection. This paper adopts an improved boosting method to enhance generalization performance of intrusion detection model based on rule learning algorithm, and presents a boosting intrusion detection rule learning algorithm (BIDRLA). The experiment results on the standard intrusion detection dataset validate the effectiveness of BIDRLA.

Wu Yang, Xiao-Chun Yun, Yong-Tian Yang
RoleOf Relationship and Its Meta Model for Design Pattern Instantiation

This paper states that the RoleOf Relationship can provide a general approach to resolve instantiation problems of design patterns. The problems come from the fact that pattern logic scatters across multiple business classes (classes specific to each application). This causes problems such as decreasing reusability of pattern logic, and losing of the instantiation information of pattern (traceability and overlapping problem) etc. To resolve these problems in design level, an approach for design pattern instantiation based on RoleOf relationship is proposed. In our approach, roles of pattern are treated as the independent modeling elements and RoleOf relationship is used to associate a role with a business class. The meta model of RoleOf relationship for pattern instantiation and its semantics are proposed as well. Examples are used to illustrate this approach. Implementation and behavior description of RoleOf relationship are also presented in the paper.

Chengwan He, Fei He, Keqing He, Jin Liu, Wenjie Tu
Automatic Inspection of Industrial Sheetmetal Parts with Single Non-metric CCD Camera

A novel approach for three-dimensional reconstruction and inspection of industrial parts with image sequence acquired by single non-metric CCD camera is proposed. The purpose of the approach is to reconstruct and thus inspect the producing imprecision (of deformation) of industrial sheetmetal parts. Planar control grid, non-metric image sequence and CAD-designed data are used as information sources. Principles of least squares template matching to extract lines and points from the imagery are presented. Hybrid point-line photogrammetry is adopted to obtain the accurate wire frame model. Circles, connected arcs and lines on the part are reconstructed with direct object space solution. The reconstructed CAD model can be used for inspection or quality control. Experimental results are very satisfying.

Yongjun Zhang
An Advanced Implementation of a Distributed Control Scheme Based on LonWorks System over IP Networks

In this paper, an advanced distributed control scheme that connects the control networks to the IP networks, based on LonWorks technology, which is one of the control networks, are presented. The proposed approach is implemented by using a simple programmable basic Lon node (BLN) and a IBM PS/2 (PC) compatible computer as an Internet server. BLN is a developed electric board in this research and it physically contains a transceiver, Neuron chip and some memory devices. To perform various functions as an Internet server of PC, control software of Lon on internet system (LOIS) with C-language for GNU/LINUX environment is also developed. Our approach makes system designers to easily implement their various specific applications, only with the download of a control program from serial port (RS-232) of PC.

Il-Joo Shim, Kyung-Bae Chang, Ki-Hyung Yu, Dong-Woo Cho, Kyoo-Dong Song, Gwi-Tae Park
Structural Damage Detection by Integrating Independent Component Analysis and Support Vector Machine

Structural damage detection is very important for identifying and diagnosing the nature of the damage in an early stage so as to reduce catastrophic failures and prolong the service life of structures. In this paper, a novel approach is presented that integrates independent component analysis (ICA) and support vector machine (SVM). The procedure involves extracting independent components from measured sensor data through ICA and then using these signals as input data for a SVM classifier. The experiment presented employs the benchmark data from the University of British Columbia to examine the effectiveness of the method. Results showed that the accuracy of damage detection using the proposed method is significantly better than the approach by integrating ICA and ANN. Furthermore, the prediction output can be used to identify different types and levels of structure damages.

Huazhu Song, Luo Zhong, Bo Han
An LZ78 Based String Kernel

We have shown [8] that LZ78 parse length can be used effectively for a music classification task. The parse length is used to compute a normalized information distance [6,7] which is then used to drive a simple classifier. In this paper we explore a more subtle use of the LZ78 parsing algorithm. Instead of simply counting the parse length of a string, we use the coding dictionary constructed by LZ78 to derive a valid string kernel for a Support Vector Machine (SVM). The kernel is defined over a feature space indexed by all the phrases identified by our (modified) LZ78 compression algorithm. We report experiments with our kernel approach on two datasets: (i) a collection of MIDI files and (ii) Reuters-21578. We compare our technique with an

n

-gram based kernel. Our results indicate that the LZ78 kernel technique has a performance similar to that obtained with the best

n

-gram performance but with significantly lower computational overhead, and without requiring a search for the optimal value of

n

.

Ming Li, Ronan Sleep
Classifying Class and Finding Community in UML Metamodel Network

Composed of many classes or modules, big software can be represented with network model. By extracting the topology of UML metamodel from the UML metamodel specification, the scale-free, small-world networks properties are revealed. Based on this observation, we come up with our algorithms that can classify all classes in UML metamodel into three kinds: core, general and leaf. Our algorithm can categorize all classes into several subgroups by three factors, i.e., degree, betweenness and weak link. It is illustrated through case study that this algorithm is effective at mining community structure in large software systems.

Bin Liu, Deyi Li, Jin Liu, Fei He
An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines

Network intrusion detection is an important technique in computer security. However, the performance of existing intrusion detection systems (IDSs) is unsatisfactory since new attacks are constantly developed and the speed of network traffic volumes increases fast. To improve the performance of IDSs both in accuracy and speed, this paper proposes a novel adaptive intrusion detection method based on principal component analysis (PCA) and support vector machines (SVMs). By making use of PCA, the dimension of network data patterns is reduced significantly. The multi-class SVMs are employed to construct classification models based on training data processed by PCA. Due to the generalization ability of SVMs, the proposed method has good classification performance without tedious parameter tuning. Dimension reduction using PCA may improve accuracy further. The method is also superior to SVMs without PCA in fast training and detection speed. Experimental results on KDD-Cup99 intrusion detection data illustrate the effectiveness of the proposed method.

Xin Xu, Xuening Wang
Improved Grid Information Service Using the Idea of File-Parted Replication

The infrastructure of grid information service constituted with highly distributed information providers and aggregate directory is brought forward on the basis of the characteristic of grid information resources in this paper.

The Lightweight Directory Access Protocol

(LDAP), one of the base protocols, is also analyzed in this paper. It is put forward that LDAP is a distributed database. The dynamic updating and replication of LDAP directory tree happens frequently. To solve the problem, it has been proposed that the strategy of

fast spread

and

Cascading spread

can boost the efficiency of grid information service system. Moreover, we use file-parted replication approach to divide the LDAP database file into several blocks that are replicated parallel between LDAP sever points then. In such a way, the system efficiency of parallel processing can be boosted by margin. In addition, based on the idea forenamed, we put forward the technique infrastructure and

upload-controlling algorithm

, both of which are proven to be effective in improving the system efficiency.

Jingwei Huang, Qingfeng Fan, Qiongli Wu, Yanxiang He
Dynamic Shape Modeling of Consumers’ Daily Load Based on Data Mining

The shape characteristic of daily power consumption of consumers can be applied to guide their power consumption behaviors and improve load structures of power system. It is also the basis to obtain the shape characteristic of daily power consumption of a trade and conduct researches in the state estimate of distribution networks etc. Traditional analytical approaches are limited to qualitative analysis with a small coverage only. We propose a model which can perform in-depth analysis of customer power consumption behaviors by data mining through similar sequence analysis to overcome the drawbacks of traditional approaches. The model uses real-time sampling of the energy data of consumers to form the shape characteristic curves. The application and testing of the model under an instance is analyzed in this paper.

Lianmei Zhang, Shihong Chen, Qiping Hu
A Study on the Mechanism of Virtual SAN-Based Spatial Data Storage with Double-Thoroughfare in Grid

Combining the advantages of the network and virtual SAN technology, the paper focuses on the characteristics of spatial data storage, by proposing a virtual SAN-based architecture of grid GIS data storage. Meanwhile, its corresponding experimental system has been designed to verify this proposal. In order for the storing facilities, FC and iSCSI data based double-thoroughfare have been designed in the proposed system to fulfill the efficient storage and retrieval of huge amount of spatial data. This solution possesses certain practical and applicable values from realizing and serving a technological foundation for exposable access and open interoperability of spatial data.

Jinsong Gao, Wen Zhang, Zequn Guan
A BP Neural Network Predictor Model for Desulfurizing Molten Iron

Desulfurization of molten iron is one of the stages of steel production process. A back-propagation (BP) artificial neural network (ANN) model is developed to predict the operation parameters for desulfurization process in this paper. The primary objective of the BP neural network predictor model is to assign the operation parameters on the basis of intelligent algorithm instead of the experience of operators. This paper presents a mathematical model and development methodology for predicting the three main operation parameters and optimizing the consumption of desulfurizer. Furthermore, a software package is developed based on this BP ANN predictor model. Finally, the feasibility of using neural networks to model the complex relationship between the parameters is been investigated.

Zhijun Rong, Binbin Dan, Jiangang Yi
A Flexible Report Architecture Based on Association Rules Mining

This paper proposes flexible report architecture based on association rules data mining. A three-layer architecture is proposed namely, origin-data layer, data-processing layer, and format layer. These three layers are linked by a data variant tree in a power information management system. Users can modify report format as well as data whenever needed. In the origin-data layer data warehouse is used to provide data from multiple databases. In the data-processing layer, on-line analytical processing (OLAP) and association rules are used to enhance the template-making for reports. A smart solution to the problem of fixed report templates is provided and information in a power information management system can be shared. In some sense it can be an all-purpose tool to generate reports with great flexibility.

Qiping Hu

Security and Privacy Issues

Privacy Preserving Naive Bayes Classification

Privacy preserving data mining is to discover accurate patterns without precise access to the original data. In this paper, we combine the two strategies of data transform and data hiding to propose a new randomization method, Randomized Response with Partial Hiding (RRPH), for distorting the original data. Then, an effective naive Bayes classifier is presented to predict the class labels for unknown samples according to the distorted data by RRPH. Shown in the analytical and experimental results, our method can obtain significant improvements in terms of privacy, accuracy, and applicability.

Peng Zhang, Yunhai Tong, Shiwei Tang, Dongqing Yang
A Further Study on Inverse Frequent Set Mining

Frequent itemset mining is a common task in data mining from which association rules are derived. As the frequent itemsets can be considered as a kind of summary of the original databases, recently the inverse frequent set mining problem has received more attention because of its potential threat to the privacy of the original dataset. Since this inverse problem has been proven to be NP-complete, people ask “

Are there reasonably efficient search strategies to find a compatible data set in practice?”

[1]. This paper describes our effort towards finding a feasible solution to address this problem.

Xia Chen, Maria Orlowska

Spatial Data Mining

Mathematical Analysis of Classifying Convex Clusters Based on Support Functionals

Classification is one of the core topics in data mining technologies. This paper studies the geometry of classifying convex clusters based on support functionals in the dual spaces. For the convex clusters that are to be classified, a combination of linear discriminant functions could solve the problem. The geometrical depiction of linear discriminant functions and supporting hyperplanes for the convex clusters help to characterize the relations of the convex clusters, and the distances to the convex clusters and complement of convex clusters calibrate the measures between the support functionals and convex clusters. Examples are given.

Xun Liang
Linear Belts Mining from Spatial Database with Mathematical Morphological Operators

In order to mine one typical non-sphere cluster, the linear belts in a spatial database, a mathematical morphological operator based method is proposed in this paper. The method can be divided into two basic steps: firstly, the most suitable re-segmenting scale is found by our clustering algorithm MSCMO which is based on mathematical morphological scale space; secondly, the segmented result at this scale is re-segmented to obtain the final linear belts. This method is a robust mining method to semi-linear clusters and noises, which is validated by the successful extraction of seismic belts.

Min Wang, Jiancheng Luo, Chenghu Zhou
Spatial Information Multi-grid for Data Mining

Driven by the issue of geo-spatial data mining under grid computing environment, a new representation method called spatial information multi-grid (SIMG) for depicting spatial data and spatial information is presented in this paper. A strategy of dividing the globe with multi-level spatial grid is proposed. And through further studying on the precision of description on feature of detail, this paper tries to divide the globe with SIMG and constructs the framework of SIMG in China. Based on SIMG this paper tries to realize data mining of different thematic information on the same geographical position, data mining of dynamic spatial-temporal information of the same thematic content, data mining and transforming of different coordinate systems and to make SIMG a fundamental research work for further developing spatial information sharing and data mining.

Zhenfeng Shao, Deren Li
A Uniform Framework of 3D Spatial Data Model and Data Mining from the Model

A uniform framework of 3D spatial data model (UF-3DSDM) is proposed in this paper. It is a union of various 3D spatial data models, and any actual 3D spatial data model is its subset. Based on the UF-3DSDM, the stratum modeling method of QPTV is further developed on the real borehole sample data. In the context of 3D QTPV model, a data mining method is given on surface DEM of 3D data.

Peng-gen Cheng
Mining Standard Land Price with Tension Spline Function

Standard land price is an economical indicator for measuring land value. In this paper, we propose to use the tension spline interpolation function to mine standard land price. First, we extend the definition of standard land price, which is based on land region composed of several neighboring land parcels with the same or similar features. Second, the regional factors that affect the standard land price are classified into the geometric features of point, line and area according to the quantitative rules. Third, a tension spline interpolation function is proposed to mine standard land price, which is determined by the influential factors. Finally, as a case study, the proposed method is applied to mine land prices for Nanning City in China. The case study shows that the proposed method is a practical and satisfactory one.

Hanning Yuan, Wenzhong Shi, Jiabing Sun

Streaming Data Mining

Mining Recent Frequent Itemsets in Data Streams by Radioactively Attenuating Strategy

We propose a novel approach for mining recent frequent itemsets. The approach has three key contributions. First, it is a single-scan algorithm which utilizes the special property of suffix-trees to guarantee that all frequent itemsets are mined. During the phase of itemset growth it is unnecessary to traverse the suffix-trees which are the data structure for storing the summary information of data. Second, our algorithm adopts a novel method for itemset growth which includes two special kinds of itemset growth operations to avoid generating any candidate itemset. Third, we devise a new regressive strategy from the attenuating phenomenon of radioelement in nature, and apply it into the algorithm to distinguish the influence of latest transactions from that of obsolete transactions. We conduct detailed experiments to evaluate the algorithm. It confirms that the new method has an excellent scalability and the performance illustrates better quality and efficiency.

Lifeng Jia, Zhe Wang, Chunguang Zhou, Xiujuan Xu
User Subjectivity in Change Modeling of Streaming Itemsets

Online mining of changes from data streams is an important problem in view of growing number of applications such as network flow analysis, e-business, stock market analysis etc. Monitoring of these changes is a challenging task because of the high speed, high volume, only-one-look characteristics of the data streams. User subjectivity in monitoring and modeling of the changes adds to the complexity of the problem.

This paper addresses the problem of i) capturing user subjectivity and ii) change modeling, in applications that monitor frequency behavior of item-sets. We propose a three stage strategy for focusing on item-sets, which are of current interest to the user and introduce metrics that model changes in their frequency (support) behavior.

Vasudha Bhatnagar, Sarabjeet Kaur Kochhar
A Grid-Based Clustering Algorithm for High-Dimensional Data Streams

The three main requirements for clustering data streams on-line are one pass over the data, high processing speed, and consuming a small amount of memory. We propose an algorithm that can fulfill these requirements by introducing an incremental grid data structure to summarize the data streams on-line. In order to deal with high-dimensional problems, the algorithm adopts a simple heuristic method to select a subset of dimensions on which all the operations for clustering are performed. Our performance study with a real network intrusion detection stream data set demonstrates the efficiency and effectiveness of our proposed algorithm.

Yansheng Lu, Yufen Sun, Guiping Xu, Gang Liu
Backmatter
Metadata
Title
Advanced Data Mining and Applications
Editors
Xue Li
Shuliang Wang
Zhao Yang Dong
Copyright Year
2005
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-31877-4
Print ISBN
978-3-540-27894-8
DOI
https://doi.org/10.1007/b11111

Premium Partner