Skip to main content
main-content

Über dieses Buch

This book constitutes the thoroughly refereed post-conference proceedings of five international workshops held in conjunction with PAKDD 2011 in Shenzhen, China, in May 2011: the International Workshop on Behavior Informatics (BI 2011), the Workshop on Quality Issues, Measures of Interestingness and Evaluation of Data Mining Models (QIMIE 2011), the Workshop on Biologically Inspired Techniques for Data Mining (BDM 2011), the Workshop on Advances and Issues in Traditional Chinese Medicine Clinical Data Mining (AI-TCM 2011), and the Second Workshop on Data Mining for Healthcare Management (DMGHM 2011). The book also includes papers from the First PAKDD Doctoral Symposium on Data Mining (DSDM 2011). The 42 papers were carefully reviewed and selected from numerous submissions. The papers cover a wide range of topics discussing emerging techniques in the field of knowledge discovery in databases and their application domains extending to previously unexplored areas such as data mining based on optimization techniques from biological behavior of animals and applications in Traditional Chinese Medicine clinical research and health care management.

Inhaltsverzeichnis

Frontmatter

International Workshop on Behavior Informatics (BI 2011)

Frontmatter

Evaluating the Regularity of Human Behavior from Mobile Phone Usage Logs

This paper investigated the relationship between incrementally logged phone logs and self-reported survey data to derive regularity and predictability from mobile phone usage logs. First, we extracted information not from a single value such as location or call logs, but from multivariate contextual logs. Then we considered the changing pattern of the incrementally logged information over time. To evaluate the patterns of human behavior, we applied entropy changes and the duplicated instances ratios from the stream of mobile phone usage logs. By applying the Hidden Markov Models to the patterns, the accumulated log patterns were classified according to the self-reported survey data. This research confirmed that regularity and predictability of human behavior can be evaluated by mobile phone usages.

Hyoungnyoun Kim, Ji-Hyung Park

Explicit and Implicit User Preferences in Online Dating

In this paper we study user behavior in online dating, in particular the differences between the implicit and explicit user preferences. The explicit preferences are stated by the user while the implicit preferences are inferred based on the user behavior on the website. We first show that the explicit preferences are not a good predictor of the success of user interactions. We then propose to learn the implicit preferences from both successful and unsuccessful interactions using a probabilistic machine learning method and show that the learned implicit preferences are a very good predictor of the success of user interactions. We also propose an approach that uses the explicit and implicit preferences to rank the candidates in our recommender system. The results show that the implicit ranking method is significantly more accurate than the explicit and that for a small number of recommendations it is comparable to the performance of the best method that is not based on user preferences.

Joshua Akehurst, Irena Koprinska, Kalina Yacef, Luiz Pizzato, Judy Kay, Tomasz Rej

Blogger-Link-Topic Model for Blog Mining

Blog mining is an important area of behavior informatics because produces effective techniques for analyzing and understanding human behaviors from social media. In this paper, we propose the blogger-link-topic model for blog mining based on the multiple attributes of blog content, bloggers, and links. In addition, we present a unique blog classification framework that computes the normalized document-topic matrix, which is applied our model to retrieve the classification results. After comparing the results for blog classification on real-world blog data, we find that our blogger-link-topic model outperforms the other techniques in terms of overall precision and recall. This demonstrates that additional information contained in blog-specific attributes can help improve blog classification and retrieval results.

Flora S. Tsai

A Random Indexing Approach for Web User Clustering and Web Prefetching

In this paper we present a novel technique to capture Web users’ behaviour based on their interest-oriented actions. In our approach we utilise the vector space model Random Indexing to identify the latent factors or hidden relationships among Web users’ navigational behaviour. Random Indexing is an incremental vector space technique that allows for continuous Web usage mining. User requests are modelled by Random Indexing for individual users’ navigational pattern clustering and common user profile creation. Clustering Web users’ access patterns may capture common user interests and, in turn, build user profiles for advanced Web applications, such as Web caching and prefetching. We present results from the Web user clustering approach through experiments on a real Web log file with promising results. We also apply our data to a prefetching task and compare that with previous approaches. The results show that Random Indexing provides more accurate prefetchings.

Miao Wan, Arne Jönsson, Cong Wang, Lixiang Li, Yixian Yang

Emotional Reactions to Real-World Events in Social Networks

A convergence of emotions among people in social networks is potentially resulted by the occurrence of an unprecedented event in real world. E.g., a majority of bloggers would react

angrily

at the September 11 terrorist attacks. Based on this observation, we introduce a

sentiment index

, computed from the

current mood

tags in a collection of blog posts utilizing an affective lexicon, potentially revealing subtle events discussed in the blogosphere. We then develop a method for extracting events based on this index and its distribution. Our second contribution is establishment of a new bursty structure in text streams termed a

sentiment burst

. We employ a stochastic model to detect bursty periods of moods and the events associated. Our results on a dataset of more than 12 million mood-tagged blog posts over a 4-year period have shown that our sentiment-based bursty events are indeed meaningful, in several ways.

Thin Nguyen, Dinh Phung, Brett Adams, Svetha Venkatesh

Constructing Personal Knowledge Base: Automatic Key-Phrase Extraction from Multiple-Domain Web Pages

In the paper, we proposed a general framework that could automatically extract key-phrases from a collection of web pages concerning a specific topic with the help of The Free Dictionary and then construct a personal knowledge base. Both the base and visual feature in a web page are used to calculate the weight of each candidate phrase. The system extracts top p% key-phrases for each web page based on these two features and then generates a term set using union operators. Next, the system builds the relationships between terms in the term set by referencing The Free Dictionary, and then generates a list of terms sorted by weights. With the top q terms specified by users, a semantic graph can be constructed to present the part of a personal knowledge base, which shows the relationships between terms from the same domain. Finally, the experimental results show that the key-phrases generated by the proposed extractor are with good quality and acceptable for humans.

Yin-Fu Huang, Cin-Siang Ciou

Discovering Valuable User Behavior Patterns in Mobile Commerce Environments

Mining user behavior patterns in mobile environments is an emerging topic in data mining fields with wide applications. By integrating moving paths with purchasing transactions, one can find the sequential purchasing patterns with the moving paths, which are called

mobile sequential patterns

of the mobile users. Mobile sequential patterns can be applied not only for planning mobile commerce environments but also analyzing and managing online shopping websites. However, unit profits and purchased numbers of the items are not considered in traditional framework of mobile sequential pattern mining. Thus, the patterns with high utility (i.e., profit here) cannot be found. In view of this, we aim at integrating mobile data mining with utility mining for finding high utility mobile sequential patterns in this study. A novel algorithm called

UMSP

L

(

high Utility Mobile Sequential Pattern mining by a Level-wised method

) is proposed to efficiently find high utility mobile sequential patterns. The experimental results show that the proposed algorithm has excellent performance under various system conditions.

Bai-En Shie, Hui-Fang Hsiao, Philip S. Yu, Vincent S. Tseng

A Novel Method for Community Detection in Complex Network Using New Representation for Communities

During the recent years, community detection in complex network has become a hot research topic in various research fields including mathematics, physics and biology. Identifying communities in complex networks can help us to understand and exploit the networks more clearly and efficiently. In this paper, we investigate the topological structure of complex networks and propose a novel method for community detection in complex network, which owns several outstanding properties, such as efficiency, robustness, broad applicability and semantic. The method is based on partitioning vertex and degree entropy, which are both proposed in this paper. Partitioning vertex is a novel efficient representation for communities and degree entropy is a new measure for the results of community detection. We apply our method to several large-scale data-sets which are up to millions of edges, and the experimental results show that our method has good performance and can find the community structure hidden in complex networks.

Wang Yiwen, Yao Min

Link Prediction on Evolving Data Using Tensor Factorization

Within the last few years a lot of research has been done on large social and information networks. One of the principal challenges concerning complex networks is link prediction. Most link prediction algorithms are based on the underlying network structure in terms of traditional graph theory. In order to design efficient algorithms for large scale networks, researchers increasingly adapt methods from advanced matrix and tensor computations.

This paper proposes a novel approach of link prediction for complex networks by means of multi-way tensors. In addition to structural data we furthermore consider temporal evolution of a network. Our approach applies the canonical

Parafac

decomposition to reduce tensor dimensionality and to retrieve latent trends.

For the development and evaluation of our proposed link prediction algorithm we employed various popular datasets of online social networks like

Facebook

and

Wikipedia

. Our results show significant improvements for evolutionary networks in terms of prediction accuracy measured through mean average precision.

Stephan Spiegel, Jan Clausen, Sahin Albayrak, Jérôme Kunegis

Permutation Anonymization: Improving Anatomy for Privacy Preservation in Data Publication

Anatomy is a popular technique for privacy preserving in data publication. However, anatomy is fragile under background knowledge attack and can only be applied into limited applications. To overcome these drawbacks, we develop an improved version of anatomy: permutation anonymization, a new anonymization technique that is more effective than anatomy in privacy protection, and meanwhile is able to retain significantly more information in the microdata. We present the detail of the technique and build the underlying theory of the technique. Extensive experiments on real data are conducted, showing that our technique allows highly effective data analysis, while offering strong privacy guarantees.

Xianmang He, Yanghua Xiao, Yujia Li, Qing Wang, Wei Wang, Baile Shi

Efficient Mining Top-k Regular-Frequent Itemset Using Compressed Tidsets

Association rule discovery based on support-confidence framework is an important task in data mining. However, the occurrence frequency (support) of a pattern (itemset) may not be a sufficient criterion for discovering interesting patterns. Temporal regularity, which can be a trace of behavior, with frequency behavior can be revealed as an important key in several applications. A pattern can be regarded as a regular pattern if it occurs regularly in a user-given period. In this paper, we consider the problem of mining top-

k

regular-frequent itemsets from transactional databases without support threshold. A new concise representation, called

compressed transaction-ids set (compressed tidset)

, and a single pass algorithm, called

TR-CT (Top-k Regular frequent itemset mining based on Compressed Tidsets)

, are proposed to maintain occurrence information of patterns and discover

k

regular itemsets with highest supports, respectively. Experimental results show that the use of the compressed tidset representation achieves highly efficiency in terms of execution time and memory consumption, especially on dense datasets.

Komate Amphawan, Philippe Lenca, Athasit Surarerks

A Method of Similarity Measure and Visualization for Long Time Series Using Binary Patterns

Similarity measure and visualization are two of the most interesting tasks in time series data mining and attract much attention in the last decade. Some representations have been proposed to reduce high dimensionality of time series and the corresponding distance functions have been used to measure their similarity. Moreover, visualization techniques are often based on such representations. One of the most popular time series visualization is time series bitmaps using chaos-game algorithm. In this paper, we propose an alternative version of the long time series bitmaps of which the number of the alphabets is not restricted to four. Simultaneously, the corresponding distance function is also proposed to measure the similarity between long time series. Our approach transforms long time series into SAX symbolic strings and constructs a non-sparse matrix which stores the frequency of binary patterns. The matrix can be used to calculate the similarity and visualize the long time series. The experiments demonstrate that our approach not only can measure the long time series as well as the “bag of pattern” (BOP), but also can obtain better visual effects of the long time series visualization than the chaos-game based time series bitmaps (CGB). Especially, the computation cost of pattern matrix construction in our approach is lower than that in CGB.

Hailin Li, Chonghui Guo, Libin Yang

A BIRCH-Based Clustering Method for Large Time Series Databases

This paper presents a novel approach for time series clustering which is based on BIRCH algorithm. Our BIRCH-based approach performs clustering of time series data with a multi-resolution transform used as feature extraction technique. Our approach hinges on the use of cluster feature (CF) tree that helps to resolve the dilemma associated with the choices of initial centers and significantly improves the execution time and clustering quality. Our BIRCH-based approach not only takes full advantages of BIRCH algorithm in the capacity of handling large databases but also can be viewed as a flexible clustering framework in which we can apply any selected clustering algorithm in Phase 3 of the framework. Experimental results show that our proposed approach performs better than k-Means in terms of clustering quality and running time, and better than I-k-Means in terms of clustering quality with nearly the same running time.

Vo Le Quy Nhon, Duong Tuan Anh

Visualizing Cluster Structures and Their Changes over Time by Two-Step Application of Self-Organizing Maps

In this paper, a novel method for visualizing cluster structures and their changes over time is proposed. Clustering is achieved by two-step application of self-organizing maps (SOMs). By two-step application of SOMs, each cluster is assigned an angle and a color. Similar clusters are assigned similar ones. By using colors and angles, cluster structures are visualized in several fashions. In those visualizations, it is easy to identify similar clusters and to see degrees of cluster separations. Thus, we can visually decide whether some clusters should be grouped or separated. Colors and angles are also used to make clusters in multiple datasets from different time periods comparable. Even if they belong to different periods, similar clusters are assigned similar colors and angles, thus it is easy to recognize that which cluster has grown or which one has diminished in time. As an example, the proposed method is applied to a collection of Japanese news articles. Experimental results show that the proposed method can clearly visualize cluster structures and their changes over time, even when multiple datasets from different time periods are concerned.

Masahiro Ishikawa

Analysis of Cluster Migrations Using Self-Organizing Maps

Discovering cluster changes in real-life data is important in many contexts, such as fraud detection and customer attrition analysis. Organizations can use such knowledge of change to adapt business strategies in response to changing circumstances. This paper is aimed at the visual exploration of migrations of cluster entities over time using Self-Organizing Maps. The contribution is a method for analyzing and visualizing entity migration between clusters in two or more snapshot datasets. Existing research on temporal clustering primarily focuses on either time-series clustering, clustering of sequences, or data stream clustering. There is a lack of work on clustering snapshot datasets collected at different points in time. This paper explores cluster changes between such snapshot data. Besides analyzing structural cluster changes, analysts often desire deeper insight into changes at the entity level, such as identifying which attributes changed most significantly in the members of a disappearing cluster. This paper presents a method to visualize migration paths and a framework to rank attributes based on the extent of change among selected entities. The method is evaluated using synthetic and real-life datasets, including data from the World Bank.

Denny, Peter Christen, Graham J. Williams

Quality Issues, Measures of Interestingness and Evaluation of Data Mining Models Workshop (QIMIE 2011)

Frontmatter

ClasSi: Measuring Ranking Quality in the Presence of Object Classes with Similarity Information

The quality of rankings can be evaluated by computing their correlation to an optimal ranking. State of the art ranking correlation coefficients like Kendall’s

τ

and Spearman’s

ρ

do not allow for the user to specify similarities between differing object classes and thus treat the transposition of objects from similar classes the same way as that of objects from dissimilar classes. We propose

ClasSi

, a new ranking correlation coefficient which deals with class label rankings and employs a class distance function to model the similarities between the classes. We also introduce a graphical representation of

ClasSi

akin to the

ROC

curve

which describes how the correlation evolves throughout the ranking.

Anca Maria Ivanescu, Marc Wichterich, Thomas Seidl

The Instance Easiness of Supervised Learning for Cluster Validity

“The statistical problem of testing cluster validity is essentially unsolved” [5]. We translate the issue of gaining credibility on the output of un-supervised learning algorithms to the supervised learning case. We introduce a notion of instance easiness to supervised learning and link the validity of a clustering to how its output constitutes an easy instance for supervised learning. Our notion of instance easiness for supervised learning extends the notion of stability to perturbations (used earlier for measuring clusterability in the un-supervised setting). We follow the axiomatic and generic formulations for cluster-quality measures. As a result, we inform the trust we can place in a clustering result using standard validity methods for supervised learning, like cross validation.

Vladimir Estivill-Castro

A New Efficient and Unbiased Approach for Clustering Quality Evaluation

Traditional quality indexes (Inertia, DB, …) are known to be method-dependent indexes that do not allow to properly estimate the quality of the clustering in several cases, as in that one of complex data, like textual data. We thus propose an alternative approach for clustering quality evaluation based on unsupervised measures of Recall, Precision and F-measure exploiting the descriptors of the data associated with the obtained clusters. Two categories of index are proposed, that are Macro and Micro indexes. This paper also focuses on the construction of a new cumulative Micro precision index that makes it possible to evaluate the overall quality of a clustering result while clearly distinguishing between homogeneous and heterogeneous, or degenerated results. The experimental comparison of the behavior of the classical indexes with our new approach is performed on a polythematic dataset of bibliographical references issued from the PASCAL database.

Jean-Charles Lamirel, Pascal Cuxac, Raghvendra Mall, Ghada Safi

A Structure Preserving Flat Data Format Representation for Tree-Structured Data

Mining of semi-structured data such as XML is a popular research topic due to many useful applications. The initial work focused mainly on values associated with tags, while most of recent developments focus on discovering association rules among tree structured data objects to preserve the structural information. Other data mining techniques have had limited use in tree-structured data analysis as they were mainly designed to process flat data format with no need to capture the structural properties of data objects. This paper proposes a novel structure-preserving way for representing tree-structured document instances as records in a standard flat data structure to enable applicability of a wider range of data analysis techniques. The experiments using synthetic and real world data demonstrate the effectiveness of the proposed approach.

Fedja Hadzic

A Fusion of Algorithms in Near Duplicate Document Detection

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some “state of the art” algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the “China-US Million Book Digital Library Project”. The experiment result proves the efficiency of these algorithms.

Jun Fan, Tiejun Huang

Searching Interesting Association Rules Based on Evolutionary Computation

In this paper, we propose an evolutionary method to search interesting association rules. Most of the association rule mining methods give a large number of rules, and it is difficult for human beings to deal with them. We study this problem by borrowing the style of a search engine, that is, searching association rules by keywords. Whether a rule is interesting or not is decided by its relation to the keywords, and we introduce both semantic and statistical methods to measure such relation. The mining process is built on an evolutionary approach, Genetic Network Programming (GNP). Different from the conventional GNP based association rule mining method, the proposed method pays more attention to generate the GNP individuals carefully, which will mine interesting association rules efficiently.

Guangfei Yang, Yanzhong Dang, Shingo Mabu, Kaoru Shimada, Kotaro Hirasawa

An Efficient Approach to Mine Periodic-Frequent Patterns in Transactional Databases

Recently, temporal occurrences of the frequent patterns in a transactional database has been exploited as an interestingness criterion to discover a class of user-interest-based frequent patterns, called periodic-frequent patterns. Informally, a frequent pattern is said to be periodic-frequent if it occurs at regular intervals specified by the user throughout the database. The basic model of periodic-frequent patterns is based on the notion of “single constraints.” The use of this model to mine periodic-frequent patterns containing both frequent and rare items leads to a dilemma called the “rare item problem.” To confront the problem, an alternative model based on the notion of “multiple constraints” has been proposed in the literature. The periodic-frequent patterns discovered with this model do not satisfy

downward closure property

. As a result, it is computationally expensive to mine periodic-frequent patterns with the model. Furthermore, it has been observed that this model still generates some uninteresting patterns as periodic-frequent patterns. With this motivation, we propose an efficient model based on the notion of “multiple constraints.” The periodic-frequent patterns discovered with this model satisfy

downward closure property

. Hence, periodic-frequent patterns can be efficiently discovered. A pattern-growth algorithm has also been discussed for the proposed model. Experimental results show that the proposed model is effective.

Akshat Surana, R. Uday Kiran, P. Krishna Reddy

Algorithms to Discover Complete Frequent Episodes in Sequences

Serial episode is a type of temporal frequent pattern in sequence data. In this paper we compare the performance of serial episode discovering algorithms. Many different algorithms have been proposed to discover different types of episodes for different applications. However, it is unclear which algorithm is more efficient for discovering different types of episodes. We compare Minepi and WinMiner which discover serial episodes defined by minimal occurrence of subsequence. We find Minepi cannot discover all minimal occurrences of serial episodes as the literature, which proposed it, claimed. We also propose an algorithm Ap-epi to discover minimal occurrences of serial episode, which is a complement of Minepi. We propose an algorithm NOE-WinMiner which discovers non-overlapping episodes and compare it with an existing algorithm. Extensive experiments demonstrate that Ap-epi outperforms Minepi(fixed) when the minimum support is large and NOE-WinMiner beats the existing algorithm which discovers non-overlapping episodes with constraints between the two adjacent events.

Jianjun Wu, Li Wan, Zeren Xu

Certainty upon Empirical Distributions

We address the problem of assessing the information conveyed by a finite discrete probability distribution, within the context of knowledge discovery. Our approach is based on two main axiomatic intuitions: (i) the minimum information is given in the case of a uniform distribution, and (ii) knowledge is akin to a notion of richness, related to the dimension of the distribution. From this perspective, we define a statistic that has a clear interpretation in terms of a

measure of certainty

, and we build up a plausible hypothesis, which offers a comprehensible insight of knowledge, with a consistent algebraic structure. This includes a native value for the uncertainty related to unseen events. Our approach is then faced up with entropy based measures. Finally, by implementing our measure in a decision tree induction algorithm, we show an empirical validation of the behavior of our measure with respect to entropy. Our conclusion is that the contributions of our measure are significant, and should definitely lead to more robust models.

Joan Garriga

Workshop on Biologically Inspired Techniques for Data Mining (BDM 2011)

Frontmatter

A Measure Oriented Training Scheme for Imbalanced Classification Problems

Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and biased, it is commonly evaluated by measures such as G-mean and ROC (Receiver Operating Characteristic) curves. However, for many classifiers, the learning process is still largely driven by error based objective functions. As a result, there is clearly a gap between the measure according to which the classifier is to be evaluated and how the classifier is trained. This paper investigates the possibility of directly using the measure itself to search the hypothesis space to improve the performance of classifiers. Experimental results on three standard benchmark problems and a real-world problem show that the proposed method is effective in comparison with commonly used sampling techniques.

Bo Yuan, Wenhuang Liu

An SVM-Based Approach to Discover MicroRNA Precursors in Plant Genomes

MicroRNAs (miRNAs) are noncoding RNAs of ~22 nucleotides that play versatile regulatory roles in multicelluler organisms. Since the cloning methods for miRNAs identification are biased towards abundant miRNAs, the computational approaches provide useful complements to identify miRNAs which are highly constrained by tissue- and time-specifically expression manners. In this paper, we propose a novel Support Vector Machine (SVM) based detector, named MiR-PD, to identify pre-miRNAs in plants. The classifier is constructed based on twelve features of pre-miRNAs, inclusive of five global features and seven sub-structure features. Trained on 790 plant pre-miRNAs and 7,900 pseudo pre-miRNAs, MiR-PD achieves 96.43% five-fold cross-validation accuracy. Tested on the newly identified 441 plant pre-miRNAs and 62,883 pseudo pre-miRNAs, MiR-PD reports an accuracy of 99.71% with 77.55% sensitivity and 99.87% specificity, suggesting a feasible genome-wide application of this miRNAs detector so as to identify novel miRNAs (especially for those species-specific miRNAs) in plants without relying on phylogenetical conservation.

Yi Wang, Cheqing Jin, Minqi Zhou, Aoying Zhou

Towards Recommender System Using Particle Swarm Optimization Based Web Usage Clustering

Efficiency and quality of the product of data mining process is a challenging question for the researchers. Different methods have been proposed in the literature to tackle these problems. Optimization based methods are a way to address this issue. We addressed the problem of data clustering by implementing swarm intelligence based optimization technique called Particle Swarm Optimization (PSO). We scaled the approach to implement it in a hierarchical way using Hierarchical Particle Swarm (HPSO) clustering. The paper also aims to outline our novel outlier detection technique. The research will lead us to provide a benchmark for web usage mining and propose a collective intelligence based recommender system for the usage of Java API documentation.

Shafiq Alam, Gillian Dobbie, Patricia Riddle

Weighted Association Rule Mining Using Particle Swarm Optimization

Association rule mining is an important data mining task that discovers relationships among items in a transaction database. Most approaches to association rule mining assume that the items within the dataset have a uniform distribution. Therefore, weighted association rule mining (WARM) was introduced to provide a notion of importance to individual items. In previous work most of these approaches require users to assign weights for each item. This is infeasible when we have millions of items in a dataset. In this paper we propose a novel method,

Weighted Association Rule Mining using Particle Swarm Optimization (WARM SWARM)

, which uses particle swarm optimization to assign meaningful item weights for association rule mining.

Russel Pears, Yun Sing Koh

An Unsupervised Feature Selection Framework Based on Clustering

Feature selection plays an important part in improving the quality of learning algorithms in machine learning and data mining. It has been widely studied in supervised learning, whereas it is still relatively rare researched in unsupervised learning. In this work, a clustering-based framework formed by an unsupervised feature selection algorithm is proposed. The proposed framework is mainly concerned with the problem of determining and choosing important features, which are selected by ranking the features according to the importance measure scores, from the original feature set without class information. Theory analyzed indicates that the time complexity of each algorithm is nearly linear with the size and the number of features of dataset. Experimental results on UCI datasets show that algorithm with different scores in the framework are able to identify the important features with clustering, and the proposed algorithm have obtained competitive results in terms of classification error rate and the degree of dimensionality reduction when compared with the state-of-the-art supervised and unsupervised feature selection approaches.

Sheng-yi Jiang, Lian-xi Wang

Workshop on Advances and Issues in Traditional Chinese Medicine Clinical Data Mining (AI-TCM 2011)

Frontmatter

Discovery of Regularities in the Use of Herbs in Traditional Chinese Medicine Prescriptions

Traditional Chinese medicine (TCM) is a discipline with its own distinct methodologies and philosophical principles. The main method of treatment in TCM is to use herb prescriptions. Typically, a number of herbs are combined to form a formula and different formulae are prescribed for different patients. Regularities on the mixture of herbs in the prescriptions are important for both clinical treatment and novel patent medicine development. In this study, we analyze TCM formula data using latent tree (LT) models. Interesting regularities are discovered. Those regularities are of interest to students of TCM as well as pharmaceutical companies that manufacture medicine using Chinese herbs.

Nevin L. Zhang, Runsun Zhang, Tao Chen

COW: A Co-evolving Memetic Wrapper for Herb-Herb Interaction Analysis in TCM Informatics

Traditional Chinese Medicine (TCM) relies heavily on interactions between herbs within prescribed formulae. However, given the combinatorial explosion due to the vast number of herbs available for treatment, the study of herb-herb interactions by pure human analysis is impractical, with computeraided analysis computationally expensive. Thus feature selection is crucial as a pre-processing step prior to herb-herb interaction analysis. In accord with this goal, a new feature selection algorithm known as a Co-evolving Memetic Wrapper (COW) is proposed: COW takes advantage of recent developments in genetic algorithms (GAs) and memetic algorithms (MAs), evolving appropriate feature subsets for a given domain. As part of preliminary research, COW is demonstrated to be effective in selecting herbs in the TCM insomnia datatset. Finally, possible future applications of COW are examined, both within TCM research and in broader data mining contexts.

Dion Detterer, Paul Kwan

Selecting an Appropriate Interestingness Measure to Evaluate the Correlation between Syndrome Elements and Symptoms

In order to select the best interestingness measure appropriate for evaluating the correlation between syndrome elements and symptoms, 60 objective interestingness measures were selected from different subjects. Firstly, a hypothesis for a good measure was proposed. Based on the hypothesis, an experiment was designed to evaluate the measures. The experiment was based on the clinical record database of past dynasties including 51,186 clinical cases. The selected dataset in this study had 44,600 records. Han and Re were selected as the experimental syndrome elements. Three indicators calculated according to the distances between two syndrome elements were obtained in the experiment and were combined into one indicator. The

Z score, φ-coefficient

and

Kappa

were selected from 60 measures after the experiment. The

Z score

and

φ- coefficient

were selected according to subjective interestingness. Finally, the

φ- coefficient

was selected as the best measure for its low computational complexity. The method introduced in this paper may be used in other similar territories. Further research of traditional Chinese medicine can be made based on the conclusion made in this paper.

Lei Zhang, Qi-ming Zhang, Yi-guo Wang, Dong-lin Yu

The Impact of Feature Representation to the Biclustering of Symptoms-Herbs in TCM

Traditional Chinese Medicine (TCM) is a holistic approach to medical treatment. Analysis and decision cannot be made in isolation, hence, the extraction of symptoms-herbs relationship is a crucial step to the research of the underlying TCM principle. Since this kind of relationship bears a lot of similarity with the gene-expression study in the microarray analysis, where the use of biclustering algorithms is common, it is logical to apply biclustering algorithms to the study of symptom-herb relationship. However, the choice of feature representation is a dominant factor in the success of any machine learning problem. This paper aims to understand the impact of different representation schemes in the biclustering of symptoms-herbs relationship. A bicluster is not helpful if the number of features is too large or too small. In order to get a desirable size for the biclusters, modified relative success ratio is considered to be the most appropriate one among the other four schemes. Some of the biclusters (using modified relative success ratio) do follow the therapeutic principle of TCM, while some biclusters with interesting feature combination that are worthwhile for clinical evaluation.

Simon Poon, Zhe Luo, Runshun Zhang

Second Workshop on Data Mining for Healthcare Management (DMHM 2011)

Frontmatter

Usage of Mobile Phones for Personalized Healthcare Solutions

One of the greatest hurdles in providing the appropriate healthcare is the availability of proper information at the point of individual’s care. Mobile phone-based health solutions can bridge this gap and can support with right information at the right time. In order to overcome some of the prevalent issues and hence provide the necessary healthcare, we here introduce one such mobile based system known as the Personalized Mobile Health Service System for Individual’s Healthcare which caters to the specific needs of the user without the constraint on mobility. This system helps in guiding the user with regard to the food they consume, the precautionary measures to be taken in case of any ailments, and when they travel to a new location etc. This system also proves to be supportive in situations when the user is in a traumatic condition suffering alone. The main advantage of the system is that it will keep updating the details to the user on a regular basis.

M. Saravanan, S. Shanthi, S. Shalini

Robust Learning of Mixture Models and Its Application on Trial Pruning for EEG Signal Analysis

This paper presents a novel method based on deterministic annealing to circumvent the problem of the sensitivity to atypical observations associated with the maximum likelihood (ML) estimator via conventional EM algorithm for mixture models. In order to learn the mixture models in a robust way, the parameters of mixture model are estimated by trimmed likelihood estimator (TLE), and the learning process is controlled by temperature based on the principle of maximum entropy. Moreover, we apply the proposed method to the single-trial electroencephalography (EEG) classification task. The motivation of this work is to eliminate the negative effects of artifacts in EEG data, which usually exist in real-life environments, and the experimental results demonstrate that the proposed method can successfully detect the outliers and therefore achieve more reliable result.

Boyu Wang, Feng Wan, Peng Un Mak, Pui In Mak, Mang I Vai

An Integrated Approach to Multi-criteria-Based Health Care Facility Location Planning

Optimal location of health care facilities is critical to the success of health care services. Given its importance, this is an active research topic in health informatics, operational research and GIS. This paper presents an integrated approach to health care facility planning whereby the methods from three research topics are combined. The integrated approach is applied in order to solve preventive health care facility location planning problems. In this approach, a new health accessibility estimation method is developed in order to capture the current characteristics of preventive health care services. Based on this, the preventive health care facility location planning problem is formalized as a multi-criteria facility location model. A new algorithm is proposed in order to solve the model. Experiments on synthetic datasets and on the Alberta breast cancer screening program data are conducted and the results support our analysis.

Wei Gu, Baijie Wang, Xin Wang

Medicinal Property Knowledge Extraction from Herbal Documents for Supporting Question Answering System

The aim of this paper is to automatically extract the medicinal properties of an object, especially an herb, from technical documents as knowledge sources for health-care problem solving through the question-answering system, especially What-Question, for disease treatment. The extracted medicinal property knowledge is based on multiple simple sentence or EDUs (Elementary Discourse Units). There are three problems of extracting the medicinal property knowledge: the herbal object identification problem, the medicinal property identification problem for each object and the medicinal property boundary determination problem. We propose using NLP (Natural Language Processing) with statistical based approach to identify the medicinal property and also with machine learning technique as Naïve Bayes with verb features for solving the boundary problem. The result shows successfully the medicinal property extraction of the precision and recall of 86% and 77%, respectively, along with 87% correctness of the boundary determination.

Chaveevan Pechsiri, Sumran Painuall, Uraiwan Janviriyasopak

First PAKDD Doctoral Symposium on Data Mining (DSDM 2011)

Frontmatter

Age Estimation Using Bayesian Process

Age problems have attracted many researchers’ attentions in recent years since they have many potential applications in human-computer interaction and other areas. Among all the age problems, automatic age estimation is one interesting problem and many methods have been proposed to solve this problem. In this paper, we use two Bayesian process regression algorithms, Gaussian process and

t

process, for age estimation. Different from previous regression methods on age estimation, which need to specify the form of regression functions or determine many parameters in regression functions in inefficient ways such as cross validation, in our methods, the form of regression function is implicitly defined by kernel function and almost all the parameters of our methods can be learnt from data automatically using efficient gradient methods. Moreover, our methods are very simple and easy to implement. Since Gaussian process is easy to be affected by outlier data points,

t

process can be viewed as a robust version of Gaussian process to solve this problem. Experiments on one public aging database FG-NET show our method is effective and comparable with the state-of-the-art methods on age estimation.

Yu Zhang

Significant Node Identification in Social Networks

Given a social network, identifying significant nodes from the network is highly desirable in many applications. In different networks formed by diverse kinds of social connections, the definitions of what are significant nodes differ with circumstances. In the literature, most previous works generally focus on expertise finding in specific social networks. In this paper, we aim to propose a general node ranking model that can be adopted to satisfy a variety of service demands. We devise an unsupervised learning method that produces the ranking list of top-k significant nodes. The characteristic of this method is that it can generate different ranking lists when diverse sets of features are considered. To demonstrate the real application of the proposed method, we design the system DblpNET that is an author ranking system based on the co-author network of DBLP computer science bibliography. We discuss further extensions and evaluate DblpNET empirically on the public DBLP dataset. The evaluation results show that the proposed method can effectively apply to real-world applications.

Chi-Yao Tseng, Ming-Syan Chen

Improving Bagging Performance through Multi-algorithm Ensembles

Bagging establishes a committee of classifiers first and then aggregates their outcomes through majority voting. Bagging has attracted considerable research interest and been applied in various application domains. Its advantages include an increased capability of handling small data sets, less sensitivity to noise or outliers, and a parallel structure for efficient implementations. However, it has been found to be less accurate than some other ensemble methods. In this paper, we propose an approach that improves bagging through the employment of multiple classification algorithms in ensembles. Our approach preserves the parallel structure of bagging and improves the accuracy of bagging. As a result, it unlocks the power and expands the user base of bagging.

Kuo-Wei Hsu, Jaideep Srivastava

Mining Tourist Preferences with Twice-Learning

Data mining techniques have been recognized as powerful tools for predictive modeling tourist decision-making process. However, two practical yet important problems have not been resolved by the data miners in empirical tourism research. Firstly, comprehensibility-the role of the data mining should not only generate accurate predictions, but also provide insights why certain prediction is made. But most widely used data mining methods that can generalize well are black-box in nature and can provide little information on the tourist decision-making facts. Secondly, the lack of training samples-it is usually rather difficult to collect enough training samples through surveying the tourist on site, especially for surveying the tourist’s decision-making facts. Many data mining methods may not achieve satisfactory performance if learned on small data set. In this paper, we show that these two problems can be addressed simultaneously using a twice-learning framework on the travel preference data. The results indicate that by addressing these two problems properly, we can predict tourist preferences accurately as well as extracting meaningful insights which would be useful for tourism marketing.

Chen Zhang, Jie Zhang

Towards Cost-Sensitive Learning for Real-World Applications

Many research work in cost-sensitive learning focused on binary class problems and assumed that the costs are precise. But real-world applications often have multiple classes and the costs cannot be obtained precisely. It is important to address these issues for cost-sensitive learning to be more useful for real-world applications. This paper gives a short introduction to cost-sensitive learning and then summaries some of our previous work related to the above two issues: (1) The analysis of why traditional Rescaling method fails to solve multi-class problems and our method Rescale

new

. (2) The problem of learning with cost intervals and our CISVM method. (3) The problem of learning with cost distributions and our CODIS method.

Xu-Ying Liu, Zhi-Hua Zhou

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise