Skip to main content
main-content

Über dieses Buch

With the ever-growing power of generating, transmitting, and collecting huge amounts of data, information overloadis nowan imminent problemto mankind. The overwhelming demand for information processing is not just about a better understanding of data, but also a better usage of data in a timely fashion. Data mining, or knowledge discovery from databases, is proposed to gain insight into aspects ofdata and to help peoplemakeinformed,sensible,and better decisions. At present, growing attention has been paid to the study, development, and application of data mining. As a result there is an urgent need for sophisticated techniques and toolsthat can handle new ?elds of data mining, e. g. , spatialdata mining, biomedical data mining, and mining on high-speed and time-variant data streams. The knowledge of data mining should also be expanded to new applications. The 6th International Conference on Advanced Data Mining and Appli- tions(ADMA2010)aimedtobringtogethertheexpertsondataminingthrou- out the world. It provided a leading international forum for the dissemination of original research results in advanced data mining techniques, applications, al- rithms, software and systems, and di?erent applied disciplines. The conference attracted 361 online submissions from 34 di?erent countries and areas. All full papers were peer reviewed by at least three members of the Program Comm- tee composed of international experts in data mining ?elds. A total number of 118 papers were accepted for the conference. Amongst them, 63 papers were selected as regular papers and 55 papers were selected as short papers.

Inhaltsverzeichnis

Frontmatter

III Data Mining Methodologies and Processes

Incremental Learning by Heterogeneous Bagging Ensemble

Classifier ensemble is a main direction of incremental learning researches, and many ensemble-based incremental learning methods have been presented. Among them, Learn++, which is derived from the famous ensemble algorithm, AdaBoost, is special. Learn++ can work with any type of classifiers, either they are specially designed for incremental learning or not, this makes Learn++ potentially supports heterogeneous base classifiers. Based on massive experiments we analyze the advantages and disadvantages of Learn++. Then a new ensemble incremental learning method, Bagging++, is presented, which is based on another famous ensemble method: Bagging. The experimental results show that Bagging ensemble is a promising method for incremental learning and heterogeneous Bagging++ has the better generalization and learning speed than other compared methods such as Learn++ and NCL.

Qiang Li Zhao, Yan Huang Jiang, Ming Xu

CPLDP: An Efficient Large Dataset Processing System Built on Cloud Platform

Data intensive applications are widely existed, such as massive data mining, search engine and high-throughput computing in bioinformatics, etc. Data processing becomes a bottleneck as the scale keeps bombing. However, the cost of processing the large scale dataset increases dramatically in traditional relational database, because traditional technology inclines to adopt high performance computer. The boost of cloud computing brings a new solution for data processing due to the characteristics of easy scalability, robustness, large scale storage and high performance. It provides a cost effective platform to implement distributed parallel data processing algorithms. In this paper, we proposed CPLDP (Cloud based Parallel Large Data Processing System), which is an innovative MapReduce based parallel data processing system developed to satisfy the urgent requirements of large data processing. In CPLDP system, we proposed a new method called operation dependency analysis to model data processing workflow and furthermore, reorder and combine some operations when it is possible. Such optimization reduces intermediate file read and write. The performance test proves that the optimization of processing workflow can reduce the time and intermediate results.

Zhiyong Zhong, Mark Li, Jin Chang, Le Zhou, Joshua Zhexue Huang, Shengzhong Feng

A General Multi-relational Classification Approach Using Feature Generation and Selection

Multi-relational classification is an important data mining task, since much real world data is organized in multiple relations. The major challenges come from, firstly, the large high dimensional search spaces due to many attributes in multiple relations and, secondly, the high computational cost in feature selection and classifier construction due to the high complexity in the structure of multiple relations. The existing approaches mainly use the inductive logic programming (ILP) techniques to derive hypotheses or extract features for classification. However, those methods often are slow and sometimes cannot provide enough information to build effective classifiers. In this paper, we develop a general approach for accurate and fast multi-relational classification using feature generation and selection. Moreover, we propose a novel similarity-based feature selection method for multi-relational classification. An extensive performance study on several benchmark data sets indicates that our approach is accurate, fast and highly scalable.

Miao Zou, Tengjiao Wang, Hongyan Li, Dongqing Yang

A Unified Approach to the Extraction of Rules from Artificial Neural Networks and Support Vector Machines

Support Vector Machines (SVM) are believed to be as powerful as Artificial Neural Networks (ANN) in modeling complex problems while avoiding some of the drawbacks of the latter such as local minimæ or reliance on architecture. However, a question that remains to be answered is whether SVM users may expect improvements in the interpretability of their models, namely by using rule extraction methods already available to ANN users. This study successfully applies the Orthogonal Search-based Rule Extraction algorithm (OSRE) to Support Vector Machines. The study evidences the portability of rules extracted using OSRE, showing that, in the case of SVM, extracted rules are as accurate and consistent as those from equivalent ANN models. Importantly, the study also shows that the OSRE method benefits from SVM specific characteristics, being able to extract less rules from SVM than from equivalent ANN models.

João Guerreiro, Duarte Trigueiros

A Clustering-Based Data Reduction for Very Large Spatio-Temporal Datasets

Today, huge amounts of data are being collected with spatial and temporal components from sources such as meteorological, satellite imagery etc. Efficient visualisation as well as discovery of useful knowledge from these datasets is therefore very challenging and becoming a massive economic need. Data Mining has emerged as the technology to discover hidden knowledge in very large amounts of data. Furthermore, data mining techniques could be applied to decrease the large size of raw data by retrieving its useful knowledge as representatives. As a consequence, instead of dealing with a large size of raw data, we can use these representatives to visualise or to analyse without losing important information. This paper presents a new approach based on different clustering techniques for data reduction to help analyse very large spatio-temporal data. We also present and discuss preliminary results of this approach.

Nhien-An Le-Khac, Martin Bue, Michael Whelan, M-Tahar Kechadi

Change a Sequence into a Fuzzy Number

In general in the literature practitioners transform of real numbers into fuzzy numbers to the median or average, so they follow the probabilistic path. However, theoreticians do not investigate transformations of real numbers into fuzzy numbers when they analyse fuzzy numbers. They usually operate only on the fuzzy data. In the paper we describe an algorithm for transforming a sequence of real numbers into a fuzzy number. The algorithms presented are used to transform multidimensional matrices constructed from times series into fuzzy matrices. They were created for a special fuzzy number and using it as an example we show how to proceed. The algorithms were used in one of the stages of a model used to forecast pollution concentrations with the help of fuzzy numbers. The data used in the computations came from the Institute of Meteorology and Water Management (IMGW).

Diana Domańska, Marek Wojtylak

Multiple Kernel Learning Improved by MMD

When training and testing data are drawn from different distributions, the performance of the classification model will be low. Such a problem usually comes from sample selection bias or transfer learning scenarios. In this paper, we propose a novel multiple kernel learning framework improved by Maximum Mean Discrepancy (MMD) to solve the problem. This new model not only utilizes the capacity of kernel learning to construct a nonlinear hyperplane which maximizes the separation margin, but also reduces the distribution discrepancy between training and testing data simultaneously, which is measured by MMD. This approach is formulated as a bi-objective optimization problem. Then an efficient optimization algorithm based on gradient descent and quadratic programming [13] is adopted to solve it. Extensive experiments on UCI and text datasets show that the proposed model outperforms traditional multiple kernel learning model in sample selection bias and transfer learning scenarios.

Jiangtao Ren, Zhou Liang, Shaofeng Hu

A Refinement Approach to Handling Model Misfit in Semi-supervised Learning

Semi-supervised learning has been the focus of machine learning and data mining research in the past few years. Various algorithms and techniques have been proposed, from generative models to graph-based algorithms. In this work, we focus on the

Cluster-and-Label

approaches for semi-supervised classification. Existing cluster-and-label algorithms are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. Otherwise, the accuracy will be low. In this paper, we propose a refinement approach to address the model misfit problem in semi-supervised classification. We show that we do not need to change the cluster-and-label technique itself to make it more flexible. Instead, we propose to use successive refinement clustering of the dataset to correct the model misfit. A series of experiments on UCI benchmarking data sets have shown that the proposed approach outperforms existing cluster-and-label algorithms, as well as traditional semi-supervised classification techniques including Selftraining and Tri-training.

Hanjing Su, Ling Chen, Yunming Ye, Zhaocai Sun, Qingyao Wu

Soft Set Approach for Selecting Decision Attribute in Data Clustering

This paper presents the applicability of soft set theory for discovering a decision attribute in information systems. It is based on the notion of a mapping inclusion in soft set theory. The proposed technique is implemented with example test case and one UCI benchmark data; US Census 1990 dataset. The results from test case show that the selected decision attribute is equivalent to that under rough set theory.

Mohd Isa Awang, Ahmad Nazari Mohd Rose, Tutut Herawan, Mustafa Mat Deris

Comparison of BEKK GARCH and DCC GARCH Models: An Empirical Study

Modeling volatility and co-volatility of a few zero-coupon bonds is a fundamental element in the field of fix-income risk evaluation. Multivariate GARCH model (MGARCH), an extension of the well-known univariate GARCH, is one of the most useful tools in modeling the co-movement of multivariate time series with time-varying covariance matrix. Grounded on the review of various formulations of multivariate GARCH model, this paper estimates two MGARCH models, BEKK and DCC form, respectively, based on the data of three AAA-rated Euro zero-coupon bonds with different maturities (6 months/1 year/2 years). Post-model diagnostics indicates satisfying fitting performance of these estimated MGARCH models. Moreover, this paper provides comparison on the goodness of fit and forecasting performances of these forms by adopting the mean absolute error (MAE) criterion. Throughout this application, the conclusion can be drawn that significant fitting and forecasting performances originate from the trade-off between parsimony and flexibility of the MGARCH models.

Yiyu Huang, Wenjing Su, Xiang Li

Adapt the mRMR Criterion for Unsupervised Feature Selection

Feature selection is an important task in data analysis. mRMR is an equivalent form of the maximal statistical dependency criterion based on mutual information for first-order incremental supervised feature selection. This paper presents a novel feature selection criterion which can be considered as the unsupervised version of mRMR. The concepts of relevance and redundancy are both concerned in the feature selection criterion. The effectiveness of the new unsupervised feature selection criterion is confirmed by the theoretical proof. Experimental validation is also conducted on several popular data sets, and the results show that the new criterion can select features highly correlated with the latent class variable.

Junling Xu

Evaluating the Distance between Two Uncertain Categorical Objects

Evaluating distances between uncertain objects is needed for some uncertain data mining techniques based on distance. An uncertain object can be described by uncertain numerical or categorical attributes. However, many uncertain data mining algorithms mainly discuss methods of evaluating distances between uncertain numerical objects. In this paper, an efficient method of evaluating distances between uncertain categorical objects is presented. The method is used in nearest-neighbor classifying. Experiments with datasets based on UCI datasets and the plant dataset of “Three Parallel Rivers of Yunnan Protected Areas” verify the method is efficient.

Hongmei Chen, Lizhen Wang, Weiyi Liu, Qing Xiao

Construction Cosine Radial Basic Function Neural Networks Based on Artificial Immune Networks

In this paper, we propose a novel Intrusion Detection algorithm utilizing both Artificial Immune Network and RBF neural network. The proposed anomaly detection method using multiple granularities artificial immune network algorithm to get the candidate hidden neurons firstly, and then, we training a cosine RBF neural network base on gradient descent learning process. The principle interest of this work is to benchmark the performance of the proposed algorithm by using KDD Cup 99 Data Set, the benchmark dataset used by IDS researchers. It is observed that the proposed approach gives better performance over some traditional approaches.

YongJin Zeng, JianDong Zhuang

Spatial Filter Selection with LASSO for EEG Classification

Spatial filtering is an important step of preprocessing for electroencephalogram (EEG) signals. Extreme energy ratio (EER) is a recently proposed method to learn spatial filters for EEG classification. It selects several eigenvectors from top and end of the eigenvalue spectrum resulting from a spectral decomposition to construct a group of spatial filters as a filter bank. However, that strategy has some limitations and the spatial filters in the group are often selected improperly. Therefore the energy features filtered by the filter bank do not contain enough discriminative information or severely overfit on small training samples. This paper utilize one of the penalized feature selection strategies called LASSO to aid us to construct the spatial filter bank termed LASSO spatial filter bank. It can learn a better selection of the spatial filters. Then two different classification methods are presented to evaluate our LASSO spatial filter bank. Their excellent performances demonstrate the stronger generalization ability of the LASSO spatial filter bank, as shown by the experimental results.

Wenting Tu, Shiliang Sun

Boolean Algebra and Compression Technique for Association Rule Mining

Association Rule represents a promising technique to find hidden patterns in database. The main issue about mining association rule in the large database. One of the most famous association rule learning algorithms is Apriori. Apriori algorithm is one of algorithms for generation of association rules. The drawback of Apriori Rule algorithm is the number of time to read data in the database equally number of each candidate were generated. Many research papers have been published trying to reduce the amount of time to read data from the database. In this paper, we propose a new algorithm that will work rapidly. Boolean Algebra and Compression technique for Association rule Mining (B-Compress) is applied to compress database and reduce the amount of times to scan database tremendously. Boolean Algebra combines, compresses, generates candidate itemset and counts the number of candidates. The construction method of B-Compress has ten times higher mining efficiency in execution time than Apriori Rule.

Somboon Anekritmongkol, M. L. Kulthon Kasamsan

Cluster Based Symbolic Representation and Feature Selection for Text Classification

In this paper, we propose a new method of representing documents based on clustering of term frequency vectors. For each class of documents we propose to create multiple clusters to preserve the intraclass variations. Term frequency vectors of each cluster are used to form a symbolic representation by the use of interval valued features. Subsequently we propose a novel symbolic method for feature selection. The corresponding symbolic text classification is also presented. To corroborate the efficacy of the proposed model we conducted an experimentation on various datasets. Experimental results reveal that the proposed method gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.

B. S. Harish, D. S. Guru, S. Manjunath, R. Dinesh

SimRate: Improve Collaborative Recommendation Based on Rating Graph for Sparsity

Collaborative filtering is a widely used recommending method. But its sparsity problem often happens and makes it defeat when rate data is too few to compute the similarity of users. Sparsity problem also could result into error recommendation. In this paper, the notion of SimRank is used to overcome the problem. Especially, a novel weighted SimRank for rate bi-partite graph, SimRate, is proposed to compute similarity between users and to determine the neighbor users. SimRate still work well for very sparse rate data. The experiments show that SimRate has advantage over state-of-the-art method.

Li Yu, Zhaoxin Shu, Xiaoping Yang

Logistic Regression for Transductive Transfer Learning from Multiple Sources

Recent years have witnessed the increasing interest in transfer learning. And transdactive transfer learning from multiple source domains is one of the important topics in transfer learning. In this paper, we also address this issue. However, a new method, namely TTLRM (Transductive Transfer based on Logistic Regression from Multi-sources) is proposed to address transductive transfer learning from multiple sources to one target domain. In term of logistic regression, TTLRM estimates the data distribution difference in different domains to adjust the weights of instances, and then builds a model using these re-weighted data. This is beneficial to adapt to the target domain. Experimental results demonstrate that our method outperforms the traditional supervised learning methods and some transfer learning methods.

Yuhong Zhang, Xuegang Hu, Yucheng Fang

Double Table Switch: An Efficient Partitioning Algorithm for Bottom-Up Computation of Data Cubes

Bottom-up computation of data cubes is an efficient approach which is adopted and developed by many other cubing algorithms such as H-Cubing, Quotient Cube and Closed Cube, etc. The main cost of bottom-up computation is recursively sorting and partitioning the base table in a worse way where large amount of auxiliary spaces are frequently allocated and released. This paper proposed a new partitioning algorithm, called Double Table Switch (DTS). It sets up two table spaces in the memory at the beginning, where the partitioned results in one table are copied into another table alternatively during the bottom-up computation. Thus DTS avoids the costly space management and achieves the constant memory usage. Further, we improve the DTS algorithm by adjusting the dimension order, etc. The experimental results demonstrate the efficiency of DTS.

Jinguo You, Lianying Jia, Jianhua Hu, Qingsong Huang, Jianqing Xi

IV Data Mining Applications and Systems

Tag Recommendation Based on Bayesian Principle

Social tagging systems have become increasingly a popular way to organize online heterogeneous resources. Tag recommendation is a key feature of social tagging systems. Many works has been done to solve this hard tag recommendation problem and has got same good results these years. Taking into account the complexity of the tagging actions, there still exist many limitations. In this paper, we propose a probabilistic model to solve this tag recommendation problem. The model is based on Bayesian principle, and it’s very robust and efficient. For evaluating our proposed method, we have conducted experiments on a real dataset extracted from BibSonomy, an online social bookmark and publication sharing system. Our performance study shows that our method achieves good performance when compared with classical approaches.

Zhonghui Wang, Zhihong Deng

Comparison of Different Methods to Fuse Theos Images

Along with the development of the remote sensing, an increasing number of remote sensing applications such as land cover classification, feature detection, urban analysis, require both high spatial and high spectral resolution. On the other hand, the Satellite can’t get high spatial and high spectral resolution at the same time because of the incoming radiation energy to the sensor and the data volume collected by the sensor. Image fusion is an effective approach to integrate disparate and complementary information of multi-source image. As a new type of Remote Sensing data source, the lately launched Theos can be widely used in many applications. So the fusion of its high spatial resolution image and multi-spectral image is important. This paper selects several widely used methods for the fusion of data of high spatial resolution and high spectral resolution. The result of each approach is evaluated by qualitative and quantitative comparison and analysis.

Silong Zhang, Guojin He

Using Genetic K-Means Algorithm for PCA Regression Data in Customer Churn Prediction

Imbalance distribution of samples between churners and non-churners can hugely affect churn prediction results in telecommunication services field. One method to solve this is over-sampling approach by PCA regression. However, PCA regression may not generate good churn samples if a dataset is nonlinear discriminant. We employed Genetic K-means Algorithm to cluster a dataset to find locally optimum small dataset to overcome the problem. The experiments were carried out on a real-world telecommunication dataset and assessed on a churn prediction task. The experiments showed that Genetic K-means Algorithm can improve prediction results for PCA regression and performed as good as SMOTE.

Bingquan Huang, T. Satoh, Y. Huang, M. -T. Kechadi, B. Buckley

Time-Constrained Test Selection for Regression Testing

The strategy of regression test selection is critical to a new version of software product. Although several strategies have been proposed, the issue, how to select test cases that not only can detect faults with high probability but also can be executed within a limited period of test time, remains open. This paper proposes to utilize data-mining approach to select test cases, and dynamic programming approach to find the optimal test case set from the selected test cases such that they can detect most faults and meet testing deadline. The models have been applied to a large financial management system with a history of 11 releases over 5 years.

Lian Yu, Lei Xu, Wei-Tek Tsai

Chinese New Word Detection from Query Logs

Existing works in literature mostly resort to the web pages or other author-centric resources to detect new words, which require highly complex text processing. This paper exploits the visitor-centric resources, specifically, query logs from the commercial search engine, to detect new words. Since query logs are generated by the search engine users, and are segmented naturally, the complex text processing work can be avoided. By dynamic time warping, a new word detection algorithm based on the trajectory similarity is proposed to distinguish new words from the query logs. Experiments based on real world data sets show the effectiveness and efficiency of the proposed algorithm.

Yan Zhang, Maosong Sun, Yang Zhang

Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization

We introduce a novel approach to incremental e-mail categorization based on identifying and exploiting “clumps” of messages that are classified similarly. Clumping reflects the local coherence of a classification scheme and is particularly important in a setting where the classification scheme is dynamically changing, such as in e-mail categorization. We propose a number of metrics to quantify the degree of clumping in a series of messages. We then present a number of fast, incremental methods to categorize messages and compare the performance of these methods with measures of the clumping in the datasets to show how clumping is being exploited by these methods. The methods are tested on 7 large real-world e-mail datasets of 7 users from the Enron corpus, where each message is classified into one folder. We show that our methods perform well and provide accuracy comparable to several common machine learning algorithms, but with much greater computational efficiency.

Alfred Krzywicki, Wayne Wobcke

Topic-Based User Segmentation for Online Advertising with Latent Dirichlet Allocation

Behavioral Targeting (BT), as a useful technique to deliver the most appropriate advertisements to the most interested users by analyzing the user behaviors pattern, has gained considerable attention in online advertising market in recent year. A main task of BT is how to automatically segment web users for ads delivery, and good user segmentation may greatly improve the effectiveness of their campaigns and increase the ad click-through rate (CTR). Classical user segmentation methods, however, rarely take the semantics of user behaviors into consideration and can not mine the user behavioral pattern as properly as should be expected. In this paper, we propose an innovative approach based on the effective semantic analysis algorithm Latent Dirichlet Allocation (LDA) to attack this problem. Comparisons with other three baseline algorithms through experiments have confirmed that the proposed approach can increase effectiveness of user segmentation significantly.

Songgao Tu, Chaojun Lu

Applying Multi-objective Evolutionary Algorithms to QoS-Aware Web Service Composition

Finding optimal solutions for QoS-aware Web service composition with conflicting objectives and various restrictions on quality matrices is a NP-hard problem. This paper proposes the use of multi-objective evolutionary algorithms (MOEAs for short) for QoS-aware service composition optimisation. More specifically, SPEA2 is introduced to achieve the goal. The algorithm is good at dealing with multi-objective combinational optimisation problems. Experimental results reveal that SPEA2 is able to approach the Pareto-optimal front with well spread distribution. The Pareto front approximations provide different trade-offs, from which the end-users may select the better one based on their preference.

Li Li, Peng Cheng, Ling Ou, Zili Zhang

Real-Time Hand Detection and Tracking Using LBP Features

In this paper a robust and real-time method for hand detection and tracking is proposed. The method is based on AdaBoost learning algorithm and local binary pattern (LBP) features. The hand is detected by the cascade of classifiers with LBP features. A detailed study was developed to select the parameters for the hand detection classifiers. When tracking the hand, a region of interest (ROI) is defined based on the hand region detected in the last frame, and in order to improve robustness on rotation affine transformation is applied to the ROI. The experimental result demonstrates that this method can successfully detect the hand and track it in real-time.

Bin Xiao, Xiang-min Xu, Qian-pei Mai

Modeling DNS Activities Based on Probabilistic Latent Semantic Analysis

Traditional Web usage mining techniques aim at discovering usage patterns from Web data at the page level, while little work is engaged in at some upper level. In this paper, we propose a novel approach to the characterization of Internet users’ preference and interests at the domain name level. By summarizing Internet user’s domain name access behaviors as the co-occurrences of users and targeting domain names, an aspect model is introduced to classify users and domain names into various groups according to their co-occurrences. Meanwhile, each group is characterized by extracting the property of

characteristic

users and domain names. Experimental results on real-world data sets show that our approach is effective in which some meaningful groups are identified. Thus, our approach could be used for detecting unusual behaviors on the Internet at the domain name level, which can alleviate the work of searching the joint space of users and domain names.

Xuebiao Yuchi, Xiaodong Lee, Jian Jin, Baoping Yan

A New Statistical Approach to DNS Traffic Anomaly Detection

In this paper, we describe a new statistical approach to detect traffic anomalies in the Domain Name System (DNS). By analyzing real-world DNS traffic data collected at some large DNS servers both authoritative and local, we find that normally the DNS traffic follows Heap’s law in dual ways. Then we utilize these findings to characterize DNS traffic properties under normal network conditions. Based on these properties, we make estimations for the traffic of forthcoming. If the forthcoming traffic actually varies a lot with our estimations, then we can infer that some anomaly happens. Our approach is simple enough and can work in real-time. Experiments on both real and simulated DNS traffic anomalies show that our approach can detect most of the common anomalies in DNS traffic effectively.

Xuebiao Yuchi, Xin Wang, Xiaodong Lee, Baoping Yan

Managing Power Conservation in Wireless Networks

A major project is investigating methods for conserving power in wireless networks. A component of this project addresses methods for predicting whether the user demand load in each zone of a network is increasing, decreasing or approximately constant. These predictions are then fed into the power regulation system. This paper describes a real-time predictive model of network traffic load which is derived from experiments on real data. This model combines a linear regression based model and a highly reactive model that are applied to real-time data that is aggregated at two levels of granularity. The model gives excellent performance predictions when applied to network traffic load data.

Kongluan Lin, John Debenham, Simeon Simoff

Using PCA to Predict Customer Churn in Telecommunication Dataset

Failure to identify potential churners affects significantly a company revenues and services that can provide. Imbalance distribution of instances between churners and non-churners and the size of customer dataset are the concerns when building a churn prediction model. This paper presents a local PCA classifier approach to avoid these problems by comparing eigenvalues of the best principal component. The experiments were carried out on a large real-world Telecommunication dataset and assessed on a churn prediction task. The experimental results showed that local PCA classifier generally outperformed Naive Bayes, Logistic regression, SVM and Decision Tree C4.5 in terms of true churn rate.

T. Sato, B. Q. Huang, Y. Huang, M. -T. Kechadi, B. Buckley

Hierarchical Classification with Dynamic-Threshold SVM Ensemble for Gene Function Prediction

The paper proposes a novel hierarchical classification approach with dynamic-threshold SVM ensemble. At training phrase, hierarchical structure is explored to select suit positive and negative examples as training set in order to obtain better SVM classifiers. When predicting an unseen example, it is classified for all the label classes in a top-down way in hierarchical structure. Particulary, two strategies are proposed to determine dynamic prediction threshold for different label class, with hierarchical structure being utilized again. In four genomic data sets, experiments show that the selection policies of training set outperform existing two ones and two strategies of dynamic prediction threshold achieve better performance than the fixed thresholds.

Yiming Chen, Zhoujun Li, Xiaohua Hu, Junwan Liu

Personalized Tag Recommendation Based on User Preference and Content

With the widely use of collaborative tagging system nowadays, users could tag their favorite resources with free keywords. Tag recommendation technology is developed to help users in the process of tagging. However, most of the tag recommendation methods are merely based on the content of tagged resource. In this paper, it is argued that tags depend not only on the content of resource, but also on user preference. As such, a hybrid personalized tag recommendation method based on user preference and content is proposed. The experiment results show that the proposed method has advantages over traditional content-based methods.

Zhaoxin Shu, Li Yu, Xiaoping Yang

Predicting Defect Priority Based on Neural Networks

Existing defect management tools provide little information on how important/urgent for developers to fix defects reported. Manually prioritizing defects is time-consuming and inconsistent among different people. To improve the efficiency of troubleshooting, the paper proposes to employ neural network techniques to predict the priorities of defects, adopt evolutionary training process to solve error problems associated with new features, and reuse data sets from similar software systems to speed up the convergence of training. A framework is built up for the model evaluation, and a series of experiments on five different software products of an international healthcare company to demonstrate the feasibility and effectiveness.

Lian Yu, Wei-Tek Tsai, Wei Zhao, Fang Wu

Personalized Context-Aware QoS Prediction for Web Services Based on Collaborative Filtering

The emergence of abundant Web Services has enforced rapid evolvement of the Service Oriented Architecture (SOA). To help user selecting and recommending the services appropriate to their needs, both functional and nonfunctional quality of service (QoS) attributes should be taken into account. Before selecting, user should predict the quality of Web Services. A Collaborative Filtering (CF)-based recommendation system is introduced to attack this problem. However, existing CF approaches generally do not consider context, which is an important factor in both recommender system and QoS prediction. Motivated by this, the paper proposes a personalized context-aware QoS prediction method for Web Services recommendations based on the SLOPE ONE approach. Experimental results demonstrate that the suggested approach provides better QoS prediction.

Qi Xie, Kaigui Wu, Jie Xu, Pan He, Min Chen

Hybrid Semantic Analysis System – ATIS Data Evaluation

In this article we show a novel method of semantic parsing. The method deals with two main issues. First, it is developed to be reliable and easy to use. It uses a simple tree-based semantic annotation and it learns from data. Second, it is designed to be used in practical applications by incorporating a method for data formalization into the system. The system uses a novel parser that extends a general probabilistic context-free parser by using context for better probability estimation. The semantic parser was originally developed for Czech data and for written questions. In this article we show an evaluation of the method on a very different domain – ATIS corpus. The achieved results are very encouraging considering the difficulties connected with the ATIS corpus.

Ivan Habernal, Miloslav Konopík

Click Prediction for Product Search on C2C Web Sites

Millions of dollars turnover is generated every day on popular ecommerce web sites. In China, more than 30 billion dollars transactions were generated from online C2C market in 2009. With the booming of this market, predicting click probability for search results is crucial for user experience, as well as conversion probability. The objective of this paper is to propose a click prediction framework for product search on C2C web sites. Click prediction is deeply researched for sponsored search, however, few studies were reported referred to the domain of online product search. We validate the performance of state-of-the-art techniques used in sponsored search for predicting click probability on C2C web sites. Besides, significant features are developed based on the characteristics of product search and a combined model is trained. Plenty of experiments are performed and the results demonstrate that the combined model improves both precision and recall significantly.

Xiangzhi Wang, Chunyang Liu, Guirong Xue, Yong Yu

Finding Potential Research Collaborators in Four Degrees of Separation

This paper proposed a methodology for finding the potential research collaborators based on structural approach underlying co-authorship network and semantic approach extends from author-topic model. We proposed the valuable features for identifying the closeness between researchers in co-authorship network. We also proved that using the combination between structural approach and semantic approach is work well. Our methodology able to suggest the researchers who appear within the four degrees of separation from the specific researcher who have never collaborated together in the past periods. The experimental results are discussed in the various aspects, for instance, top-

n

retrieved researchers and researcher’s community. The results show that our proposed idea is the applicable method used for collaborator suggestion task.

Paweena Chaiwanarom, Ryutaro Ichise, Chidchanok Lursinsap

Predicting Product Duration for Adaptive Advertisement

Whether or not the C2C customers would click the advertisement heavily relies on advertisement content relevance and customers’ searching progress. For example, when starting a purchasing task, customers are more likely to click the advertisements of their target products; while approaching the end, advertisements on accessories of the target products may interest the customers more. Therefore, the understanding of search progress on target products is very important in improving adaptive advertisement strategies. Search progress can be estimated by the time spent on the target product and the total time will be spent on this product. For the purpose of providing important information for product progress estimation, we propose a product duration prediction problem. Due to the similarities between the product duration prediction problem and user preference prediction problem (e.g. Large number of users, a history of past behaviors and ratings), the present work relies on the collaborative filtering method to estimate the searching duration of performing a purchasing task. Comparing neighbor-based, singular vector decomposition(SVD) and biased SVD method, we find biased SVD is superior to the others.

Zhongqi Guo, Yongqiang Wang, Gui-rong Xue, Yong Yu

An Algorithm for Available Bandwidth Estimation of IPv6 Network

Based on the analysis of the measurement principle of IPv4 network bandwidth and with the combination of the next-generation network protocol IPv6, we put forward a one-way and different-length packet pair subtraction algorithm for available bandwidth estimation of IPv6 network. An IPv6 network available bandwidth estimation prototype system was designed and programmed by using the flow label field of IPv6 messages to control the sequence path of tested messages. The test results show that the algorithm is feasible for IPv6 network with the estimation error less than 0.1M. The estimation results are more stable and better to reflect the real-time correlation of network available bandwidth and time delay, providing a useful means of network monitoring and performance estimation while effectively reducing the rate of network congestion.

Quanjie Qiu, Zhiguo Li, Zhongfu Wu

A Structure-Based XML Storage Method in YAFFS File System

The design of Flash File system YAFFS aims at dealing with problems at start time, memory consumption and wear balance etc. It cannot manage great amount of complex, structural and semi-structural data. The key reason for these points exists in the fact that YAFFS file system treats data as bit stream without semantics. And YAFFS file system uses special applications to operate the inner structures and contents of files, which finally put the control of file to a bad grain level. We propose a structure-based XML storage method in YAFFS file system. The experiment on embedded Linux proved that, with our method, the XML structure-based information can be stored and managed effectively.

Ji Liu, Shuyu Chen, Haozhang Liu

A Multi-dimensional Trustworthy Behavior Monitoring Method Based on Discriminant Locality Preserving Projections

Trustworthy decision is a key step in trustworthy computing, and the system behavior monitoring is the base of the trustworthy decision. Traditional anomaly monitoring methods describe a system by using single behavior feature, so it’s impossible to acquire the overall status of a system. A new method, called discriminant locality preserving projections (DLPP), is proposed to monitor multi-dimensional trustworthy behaviors in this paper. DLPP combines the idea of Fisher discriminant analysis (FDA) with that of locality preserving projections (LPP). This method is testified by events injection, and the experimental results show that DLPP is correct and effective.

Guanghui Chang, Shuyu Chen, Huawei Lu, Xiaoqin Zhang

NN-SA Based Dynamic Failure Detector for Services Composition in Distributed Environment

Neural network(NN) and simulation annealing algorithm(SA), combined with adaptive heartbeat mechanism, are integrated to implement an adaptive failure detector for services composition in distributed environment. The simulation annealing algorithm has the strong overall situation optimization ability, therefore in this article a NN-SA model, which connect simulation annealing algorithm and the BP neural network algorithm, is proposed, to predict heartbeat arrival time dynamically. It overcome the flaw running into the partial minimum of the BP neural network. Experimental results show the availability and validity of the failure detector in detail.

Changze Wu, Kaigui Wu, Li Feng, Dong Tian

Two-Fold Spatiotemporal Regression Modeling in Wireless Sensor Networks

Distributed data and restricted limitations of sensor nodes make doing regression difficult in a wireless sensor network. In conventional methods, gradient descent and Nelder Mead simplex optimization techniques are basically employed to find the model incrementally over a Hamiltonian path among the nodes. Although Nelder Mead simplex based approaches work better than gradient ones, compared to Central approach, their accuracy should be improved even further. Also they all suffer from high latency as all the network nodes should be traversed node by node. In this paper, we propose a two-fold distributed cluster-based approach for spatiotemporal regression over sensor networks. First, the regressor of each cluster is obtained where spatial and temporal parts of the cluster’s regressor are learned separately. Within a cluster, the cluster nodes collaborate to compute the temporal part of the cluster’s regressor and the cluster head then uses particle swarm optimization to learn the spatial part. Secondly, the cluster heads collaborate to apply weighted combination rule distributively to learn the global model. The evaluation and experimental results show the proposed approach brings lower latency and more energy efficiency compared to its counterparts while its prediction accuracy is considerably acceptable in comparison with the Central approach.

Hadi Shakibian, Nasrollah Moghadam Charkari

Generating Tags for Service Reviews

This paper proposes an approach to generating tags for service reviews. We extract candidate service aspects from reviews, score candidate opinion words and weight extracted candidate service aspects. Tags are automatically generated for reviews by combining aspect weights, aspect ratings and aspect opinion words. Experimental results show our approach is effective to extract, rank, and rate service aspects.

Suke Li, Jinmei Hao, Zhong Chen

Developing Treatment Plan Support in Outpatient Health Care Delivery with Decision Trees Technique

This paper presents treatment plan support (TPS) development with the aim to support treatment decision making for physicians during outpatient-care giving to patients. Evidence-based clinical data from system database was used. The TPS predictive modeling was generated using decision trees technique, which incorporated predictor variables: patient’s age, gender, racial, marital status, occupation, visit complaint, clinical diagnosis and final diagnosed diseases; while dependent variable: treatment by drug, laboratory, imaging and/or procedure. Six common diseases which are coded as J02.9, J03.9, J06.9, J30.4, M62.6 and N39.0 in the International Classification of Diseases 10

th

Revision (ICD-10) by World Health Organization were selected as prototypes for this study. The good performance scores from experimental results indicate that this study can be used as guidance in developing support specifically on treatment plan in outpatient health care delivery.

Shahriyah Nyak Saad Ali, Ahmad Mahir Razali, Azuraliza Abu Bakar, Nur Riza Suradi

Factor Analysis of E-business in Skill-Based Strategic Collaboration

Based on a comprehensive analysis of skill-based business collaboration in China and abroad, this paper discusses indicators for E-business in skill-based strategic collaboration with field investigation, exploratory factor analysis method and structural equation modeling (SEM). We design the questionnaire which fits China’s actual conditions; 1000 pieces were given out randomly and 105 valid pieces were return. Cross validation proves that there are three dimensions in E-business, which are network technology, network management and network marketing. This structure equation is fitted and shows good reliability and validity of the questionnaire.

Daijiang Chen, Juanjuan Chen

Increasing the Meaningful Use of Electronic Medical Records: A Localized Health Level 7 Clinical Document Architecture System

The health information systems of most medical institutions in China are isolated. Communications across these systems are generally realized through point-to-point interfaces at the database level, which tend to lack interoperability, extensibility, and security. In the resent study, we developed localized, document-oriented and data-focused clinical document architecture (CDA) templates based on health level 7 (HL7) CDA. Then, by combining these templates with the Service-oriented architecture for HL7 middleware, we accomplished interoperability across multiple heterogeneous systems. Modules of our system have been put into trial use in six medical institutions, including the Second University Hospital (main campus and regional campuses) of the School of Medicine, Zhejiang University and other collaborating hospitals.

Jun Liang, Mei Fang Xu, Lan Juan Li, Sheng Li Yang, Bao Luo Li, De Ren Cheng, Ou Jin, Li Zhong Zhang, Long Wei Yang, Jun Xiang Sun

Corpus-Based Analysis of the Co-occurrence of Chinese Antonym Pairs

Chinese antonym pairs pattern differently than antonyms in English. The opposites in a Chinese antonym pair often co-occur in a sentence. In a phenomenon unique to Chinese, most antonym pairs provide a good basis for the formation of four-character phrases. The opposites in a pair can usually be interpolated by an arbitrary number of characters to form new larger collocations, in such a way that the same number of characters precedes or follows each element of the antonym pair to keep the new collocation symmetric. All compounds and phrases with Chinese antonym pairs have the expanded form of a+X+b+!X, where !X denotes the opposite of X, a and b are the interpolated elements having the same character length. In our work, we characterized patterning of the interpolated elements and analyzed typical interpolations of one character in canonical Chinese antonym pairs, and identified the patterns involved in the separation and linkage of the antonym pairs.

Xingfu Wang, Zhongfu Wu, Yan Li, Qian Huang, Jinglu Hui

Application of Decision-Tree Based on Prediction Model for Project Management

In recent years, with rapid development of China’s space program and aerospace technology, massive data of project management have been collected. However, provided massive data, there is plenty room to improve our project management standards, and the inheritable and applicable knowledge on data. This paper reports on methodology based on data mining techniques to tackle such issues, in order to promote project management standards and technical reformation.

Xin-ying Tu, Tao Fu

Management Policies Analysis for Multi-core Shared Caches

To improve performance and fairness of the LLC shared among the multiple cores, the recent

Promotion/Insertion Pseudo-Partitioning

(PIPP) that combines dynamic insertion and promotion into the cache management policy. Compared with PPIP, in this work we propose a new

Homologous Promotion Insertion Policy

(HPIP) which can determine the insertion position when a

new

core situation occurs and balance the cache resource allocation simultaneously. HPIP depends on the existing cache structure and require negligible change overhead. In addition, we analyze

Dynamic Insertion Policy

(DIP) and maintain that the sampling sets selection for

Set Dueling Monitors

(SDM) should be according to a processor’s cores number rather than the running applications. Finally, our experiments with multi-programmed workloads for 2-core, 4-core CMPs based on M5 simulator show that the performance of HPIP approximate to PPIP and its adaptive capability is enhanced.

Jianjun Du, Yixing Zhang, Zhongfu Wu, Xinwen Wang

Multi-core Architecture Cache Performance Analysis and Optimization Based on Distributed Method

With the rapid development of computing performance on multi-core era, the capacity of shared cache has been increasing. System architects need make maximum usage of shared resources to improve system performance. This paper mainly rebuilt free lists based on page coloring for achieving their privatization by a distributed method, which could really achieve page-level parallelism at the operating system level and decrease cache thrashing among applications. Experimental results show that if the paper uses matrix computing as working load, L2 Cache Misses Rate is reduced by about 12%, IPC increased by 10%.

Kefei Cheng, Kewen Pan, Jun Feng, Yong Bai

The Research on the User Experience of E-Commercial Website Based on User Subdivision

Aiming at the problem that it is difficult to accurately define the quality of the user experience, based on the usability testing we analyzed the elements of user experience during the use process of E-commercial website and divided the users into two types (planned users and impulsive users). Furthermore, we constructed the user evaluation system from the three aspects of behavior, cognition and emotion, quantified the quantitative indicators, and established a user experience comprehensive evaluation model. Finally, we verified the validity of the model through some cases.

Wei Liu, Lijuan Lv, Daoli Huang, Yan Zhang

An Ontology-Based Framework Model for Trustworthy Software Evolution

In this paper, a framework model for trustworthy software evolution is constructed. It could not only solve semantic problems but also guide the dynamic evolution of service composition. First of all, it adopts a method of ontology space to solve the interactive problem among users, system and environment. Then, a set of pre-defined rules are used to evaluate the credibility of software behavior and the necessity for self-adjustment. According to these results, we make adjustment, reconfiguration and revision to the software in software life cycle, from rule guidance in micro-level to man-machine cooperation in macro-level. Finally, the instances and test results prove the proposed framework model to be effective and feasible.

Ji Li, Chunmei Liu, Zhiguo Li

Multi-level Log-Based Relevance Feedback Scheme for Image Retrieval

Relevance feedback has been shown as a powerful tool to improve the retrieval performance of content-based image retrieval (CBIR). However, the feedback iteration process is tedious and time-consuming. History log consists of valuable information about previous users’ perception of the content of image and such information can be used to accelerate the feedback iteration process and enhance the retrieval performance. In this paper, a novel algorithm to collect and compute the log-based relevance of the images is proposed. We utilize the multi-level structure of log-based relevance and fully mine previous users’ perception of content of images in log. Experimental results show that our algorithm is effective and outperforms previous schemes.

Huanchen Zhang, Weifeng Sun, Shichao Dong, Long Chen, Chuang Lin

A Distributed Node Clustering Mechanism in P2P Networks

A P2P network is an important computing model because of its scalability, adaptability, self-organization, etc. How to organize the nodes in P2P networks effectively is an important research issue. The node clustering aims to provide an effective method to organize the nodes in P2P networks. This paper proposes a distributed node clustering mechanism based on nodes’ queries in P2P networks. In this mechanism, we propose three algorithms: maintaining of node clusters, merging of node clusters and splitting of node clusters. Theoretical analysis shows the time and communication complexity of this clustering mechanism is low. Simulation results show that the clustering accuracy of this clustering mechanism is high.

Mo Hai, Shuhang Guo

Exploratory Factor Analysis Approach for Understanding Consumer Behavior toward Using Chongqing City Card

This paper examines the attitude and consumer behavior toward using Chongqing City Card by surveying 202 respondents with a self-administered questionnaire. With exploratory factor analysis, there are seven major factors found out, which are defined as general dimension, marketing dimension, use cost dimension, technology dimension, utility dimension, convenience dimension and E-commerce dimension. Then some suggestions for better operation of Chongqing City Card are provided according to these findings. Limitation and future researches are given also.

Juanjuan Chen, Chengliang Wang

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise