Skip to main content

2016 | Buch

Trends and Applications in Knowledge Discovery and Data Mining

PAKDD 2016 Workshops, BDM, MLSDA, PACC, WDMBF, Auckland, New Zealand, April 19, 2016, Revised Selected Papers

insite
SUCHEN

Über dieses Buch

This book constitutes the thoroughly refereed post-workshop proceedings at PAKDD Workshops 2016, held in conjunction with PAKDD, the 20th Pacific-Asia Conference on Knowledge Discovery and Data Mining in Auckland, New Zealand, in April 2016.

The 23 revised papers presented were carefully reviewed and selected from 38 submissions. The workshops affiliated with PAKDD 2016 include: Biologically Inspired Data Mining Techniques, BDM; Machine Learning for Sensory Data Analysis, MLSDA; Predictive Analytics for Critical Care, PACC; as well as Data Mining in Business and Finance, WDMBF.

Inhaltsverzeichnis

Frontmatter
Erratum to: Normalized Cross-Match: Pattern Discovery Algorithm from Biofeedback Signals
Xueyuan Gong, Simon Fong, Yain-Whar Si, Robert P. Biuk-Aghai, Raymond K. Wong, Athanasios V. Vasilakos

Biologically Inspired Data Mining Techniques (BDM)

Frontmatter
Towards a New Evolutionary Subsampling Technique for Heuristic Optimisation of Load Disaggregators
Abstract
In this paper we present some preliminary work towards the development of a new evolutionary subsampling technique for solving the non-intrusive load monitoring (NILM) problem. The NILM problem concerns using predictive algorithms to analyse whole-house energy usage measurements, so that individual appliance energy usages can be disaggregated. The motivation is to educate home owners about their energy usage. However, by their very nature, the datasets used in this research are massively imbalanced in their target value distributions. Consequently standard machine learning techniques, which often rely on optimising for root mean squared error (RMSE), typically fail. We therefore propose the target-weighted RMSE (TW-RMSE) metric as an alternative fitness function for optimising load disaggregators, and show in a simple initial study in which random search is utilised that TW-RMSE is a metric that can be optimised, and therefore has the potential to be included in a larger evolutionary subsampling-based solution to this problem.
Michael Mayo, Sara Omranian
Neural Choice by Elimination via Highway Networks
Abstract
We introduce Neural Choice by Elimination, a new framework that integrates deep neural networks into probabilistic sequential choice models for learning to rank. Given a set of items to chose from, the elimination strategy starts with the whole item set and iteratively eliminates the least worthy item in the remaining subset. We prove that the choice by elimination is equivalent to marginalizing out the random Gompertz latent utilities. Coupled with the choice model is the recently introduced Neural Highway Networks for approximating arbitrarily complex rank functions. We evaluate the proposed framework on a large-scale public dataset with over 425K items, drawn from the Yahoo! learning to rank challenge. It is demonstrated that the proposed method is competitive against state-of-the-art learning to rank methods.
Truyen Tran, Dinh Phung, Svetha Venkatesh
Attribute Selection and Classification of Prostate Cancer Gene Expression Data Using Artificial Neural Networks
Abstract
Artificial Intelligence (AI) approaches for medical diagnosis and prediction of cancer are important and ever growing areas of research. Artificial Neural Networks (ANN) is one such approach that have been successfully applied in these areas. Various types of clinical datasets have been used in intelligent decision making systems for medical diagnosis, especially cancer for over three decades. However, gene expression datasets are complex with large numbers of attributes which make it more difficult for AI approaches to classification and prediction. Prostate Cancer dataset is one such dataset with 12600 attributes and only 102 samples. In this paper, we propose an extended ANN based approach for classification and prediction of prostate cancer using gene expression data. Firstly, we use four attribute selection approaches, namely Sequential Floating Forward Selection (SFFS), RELIEFF, Sequential Backward Feature Section (SFBS) and Significant Attribute Evaluation (SAE) to identify the most influential attributes among 12600. We use ANNs and Naive Bayes for classification with complete sets of attributes as well as various sets obtained from attribute selection methods. Experimental results show that ANN outperformed Naive Bayes by achieving a classification accuracy of 98.2 % compared to 62.74 % with the full set of attributes. Further, with 21 selected attributes obtained with SFFS, ANNs achieved better accuracy (100 %) for classification compared to Naive Bayes. For prediction using ANNs, SFFS was able achieve best results with 92.31 % of accuracy by correctly predicting 24 out of 26 samples provided for independent sample testing. Moreover, some of the gene selected by SFFS are identified to have a direct reference to cancer and tumour. Our results indicate that a combination of standard feature selection methods in conjunction with ANNs provide the most impressive results.
Sreenivas Sremath Tirumala, A. Narayanan
An Improved Self-Structuring Neural Network
Abstract
Creating a neural network based classification model is traditionally accomplished using the trial and error technique. However, the trial and error structuring method nornally suffers from several difficulties including overtraining. In this article, a new algorithm that simplifies structuring neural network classification models has been proposed. It aims at creating a large structure to derive classifiers from the training dataset that have generally good predictive accuracy performance on domain applications. The proposed algorithm tunes crucial NN model thresholds during the training phase in order to cope with dynamic behavior of the learning process. This indeed may reduce the chance of overfitting the training dataset or early convergence of the model. Several experiments using our algorithm as well as other classification algorithms, have been conducted against a number of datasets from University of California Irvine (UCI) repository. The experiments’ are performed to assess the pros and cons of our proposed NN method. The derived results show that our algorithm outperformed the compared classification algorithms with respect to several performance measures.
Rami M. Mohammad, Fadi Thabtah, Lee McCluskey
Imbalanced ELM Based on Normal Density Estimation for Binary-Class Classification
Abstract
The imbalanced Extreme Learning Machine based on kernel density estimation (imELM-kde) is a latest classification algorithm for handling the imbalanced binary-class classification. By adjusting the real outputs of training data with intersection point of two probability density functions (p.d.f.s) corresponding to the predictive outputs of majority and minority classes, imELM-kde updates ELM which is trained based on the original training data and thus improves the performance of ELM-based imbalanced classifier. In this paper, we analyze the shortcomings of imELM-kde and then propose an improved version of imELM-kde. The Parzen window method used in imELM-kde leads to multiple intersection points between p.d.f.s of majority and minority classes. In addition, it is unreasonable to update the real outputs with intersection point, because the p.d.f.s are estimated based on the predictive outputs. Thus, in order to improve the shortcomings of imELM-kde, an imbalanced ELM based on normal density estimation (imELM-nde) is proposed in this paper. In imELM-nde, the p.d.f.s of predictive outputs corresponding to majority and minority classes are computed with normal density estimation and the intersection point is used to update the predictive outputs instead of real outputs. This makes the training of probability density estimation-based imbalanced ELM simpler and more feasible. The comparative results show that our proposed imELM-nde performs better than unweighted ELM and imELM-kde for imbalanced binary-class classification problem.
Yulin He, Rana Aamir Raza Ashfaq, Joshua Zhexue Huang, Xizhao Wang
Multiple Seeds Based Evolutionary Algorithm for Mining Boolean Association Rules
Abstract
Most of the association rule mining algorithms use a single seed for initializing a population without paying attention to the effectiveness of an initial population in an evolutionary learning. Recently, researchers show that an initial population has significant effects on producing good solutions over several generations of a genetic algorithm. There are two significant challenges raised by single seed based genetic algorithms for real world applications: (1) solutions of a genetic algorithm are varied, since different seeds generate different initial populations, (2) it is a hard process to define an effective seed for a specific application. To avoid these problems, in this paper we propose a new multiple seeds based genetic algorithm (MSGA) which generates multiple seeds from different domains of a solution space to discover high quality rules from a large data set. This approach introduces m-domain model and m-seeds selection process through which the whole solution space is subdivided into m-number of same size domains and from each domain it selects a seed. By using these seeds, this method generates an effective initial population to perform an evolutionary learning of the fitness value of each rule. As a result, this method obtains strong searching efficiency at the beginning of the evolution and achieves fast convergence along with the evolution. MSGA is tested with different mutation and crossover operators for mining interesting Boolean association rules from different real world data sets and compared the results with different single seeds based genetic algorithms.
Mir Md. Jahangir Kabir, Shuxiang Xu, Byeong Ho Kang, Zongyuan Zhao

Machine Learning for Sensory Data Analysis (MLSDA)

Frontmatter
Predicting Phone Usage Behaviors with Sensory Data Using a Hierarchical Generative Model
Abstract
Using a sizable set of sensory data and related usage records on Android devices, we are able to give a reasonable prediction of three imporant aspects of phone usage: messages, phone calls and cellular data. We solve the problem via an estimation of a user’s daily routine, on which we can train a hierarchical generative model on phone usages in all time slots of a day. The model generates phone usage behaviors in terms of three kinds of data: the state of user-phone interaction, occurrence times of an activity and the duration of the activity in each occurrence. We apply the model on a dataset with 107 frequent users, and find the prediction error of generative model is the smallest when compare with several other baseline methods. In addition, CDF curves illustrate the availability of generative model for most users with the distribution of prediction error for all test cases. We also explore the effects of time slots in a day, as well as size of training and test sets. The results suggest several interesting directions for further research.
Chuankai An, Dan Rockmore
Comparative Evaluation of Action Recognition Methods via Riemannian Manifolds, Fisher Vectors and GMMs: Ideal and Challenging Conditions
Abstract
We present a comparative evaluation of various techniques for action recognition while keeping as many variables as possible controlled. We employ two categories of Riemannian manifolds: symmetric positive definite matrices and linear subspaces. For both categories we use their corresponding nearest neighbour classifiers, kernels, and recent kernelised sparse representations. We compare against traditional action recognition techniques based on Gaussian mixture models and Fisher vectors (FVs). We evaluate these action recognition techniques under ideal conditions, as well as their sensitivity in more challenging conditions (variations in scale and translation). Despite recent advancements for handling manifolds, manifold based techniques obtain the lowest performance and their kernel representations are more unstable in the presence of challenging conditions. The FV approach obtains the highest accuracy under ideal conditions. Moreover, FV best deals with moderate scale and translation changes.
Johanna Carvajal, Arnold Wiliem, Chris McCool, Brian Lovell, Conrad Sanderson
Rigidly Self-Expressive Sparse Subspace Clustering
Abstract
Sparse subspace clustering is a well-known algorithm, and it is widely used in many research field nowadays, and a lot effort has been contributed to improve it. In this paper, we propose a novel approach to obtain the coefficient matrix. Compared with traditional sparse subspace clustering (SSC) approaches, the key advantage of our approach is that it provides a new perspective of the self-expressive property. We call it rigidly self-expressive (RSE) property. This new formulation captures the rigidly self-expressive property of the data points in the same subspace, and provides a new formulation for sparse subspace clustering. Extensions to traditional SSC could also be cooperating with this new formulation. We present a first-order algorithm to solve the nonconvex optimization, and further prove that it converges to a KKT point of the nonconvex problem under certain standard assumptions. Extensive experiments on the Extended Yale B dataset, the USPS digital images dataset, and the Columbia Object Image Library shows that for images with up to 30 % missing pixels the clustering quality achieved by our approach outperforms the original SSC.
Linbo Qiao, Bofeng Zhang, Yipin Sun, Jinshu Su
Joint Recognition and Segmentation of Actions via Probabilistic Integration of Spatio-Temporal Fisher Vectors
Abstract
We propose a hierarchical approach to multi-action recognition that performs joint classification and segmentation. A given video (containing several consecutive actions) is processed via a sequence of overlapping temporal windows. Each frame in a temporal window is represented through selective low-level spatio-temporal features which efficiently capture relevant local dynamics. Features from each window are represented as a Fisher vector, which captures first and second order statistics. Instead of directly classifying each Fisher vector, it is converted into a vector of class probabilities. The final classification decision for each frame is then obtained by integrating the class probabilities at the frame level, which exploits the overlapping of the temporal windows. Experiments were performed on two datasets: s-KTH (a stitched version of the KTH dataset to simulate multi-actions), and the challenging CMU-MMAC dataset. On s-KTH, the proposed approach achieves an accuracy of 85.0 %, significantly outperforming two recent approaches based on GMMs and HMMs which obtained 78.3 % and 71.2 %, respectively. On CMU-MMAC, the proposed approach achieves an accuracy of 40.9 %, outperforming the GMM and HMM approaches which obtained 33.7 % and 38.4 %, respectively. Furthermore, the proposed system is on average 40 times faster than the GMM based approach.
Johanna Carvajal, Chris McCool, Brian Lovell, Conrad Sanderson
Learning Multi-faceted Activities from Heterogeneous Data with the Product Space Hierarchical Dirichlet Processes
Abstract
Hierarchical Dirichlet processes (HDP) was originally designed and experimented for a single data channel. In this paper we enhanced its ability to model heterogeneous data using a richer structure for the base measure being a product-space. The enhanced model, called Product Space HDP (PS-HDP), can (1) simultaneously model heterogeneous data from multiple sources in a Bayesian nonparametric framework and (2) discover multilevel latent structures from data to result in different types of topics/latent structures that can be explained jointly. We experimented with the MDC dataset, a large and real-world data collected from mobile phones. Our goal was to discover identity–location–time (a.k.a who-where-when) patterns at different levels (globally for all groups and locally for each group). We provided analysis on the activities and patterns learned from our model, visualized, compared and contrasted with the ground-truth to demonstrate the merit of the proposed framework. We further quantitatively evaluated and reported its performance using standard metrics including F1-score, NMI, RI, and purity. We also compared the performance of the PS-HDP model with those of popular existing clustering methods (including K-Means, NNMF, GMM, DP-Means, and AP). Lastly, we demonstrate the ability of the model in learning activities with missing data, a common problem encountered in pervasive and ubiquitous computing applications.
Thanh-Binh Nguyen, Vu Nguyen, Svetha Venkatesh, Dinh Phung
Phishing Detection on Twitter Streams
Abstract
With the prevalence of cutting-edge technology, the social media network is gaining popularity and is becoming a worldwide phenomenon. Twitter is one of the most widely used social media sites, with over 500 million users all around the world. Along with its rapidly growing number of users, it has also attracted unwanted users such as scammers, spammers and phishers. Research has already been conducted to prevent such issues using network or contextual features with supervised learning. However, these methods are not robust to changes, such as temporal changes or changes in phishing trends. Current techniques also use additional network information. However, these techniques cannot be used before spammers form a particular number of user relationships. We propose an unsupervised technique that detects phishing in Twitter using a 2-phase unsupervised learning algorithm called PDT (Phishing Detector for Twitter). From the experiments we show that our technique has high accuracy ranging between 0.88 and 0.99.
Se Yeong Jeong, Yun Sing Koh, Gillian Dobbie
Image Segmentation with Superpixel Based Covariance Descriptor
Abstract
This paper investigates the problem of image segmentation using superpixels. We propose two approaches to enhance the discriminative ability of the superpixel’s covariance descriptors. In the first one, we employ the Log-Euclidean distance as the metric on the covariance manifolds, and then use the RBF kernel to measure the similarities between covariance descriptors. The second method is focused on extracting the subspace structure of the set of covariance descriptors by extending a low rank representation algorithm on to the covariance manifolds. Experiments are carried out with the Berkly Segmentation Dataset, and compared with the state-of-the-art segmentation algorithms, both methods are competitive.
Xianbin Gu, Martin Purvis

Predictive Analytics for Critical Care (PACC)

Frontmatter
Normalized Cross-Match: Pattern Discovery Algorithm from Biofeedback Signals
Abstract
Biofeedback signals are important elements in critical care applications, such as monitoring ECG data of a patient, discovering patterns from large amount of ECG data sets, detecting outliers from ECG data, etc. Because the signal data update continuously and the sampling rates may be different, time-series data stream is harder to be dealt with compared to traditional historical time-series data. For the pattern discovery problem on time-series streams, Toyoda proposed the CrossMatch (CM) approach to discover the patterns between two time-series data streams (sequences), which requires only O(n) time per data update, where n is the length of one sequence. CM, however, does not support normalization, which is required for some kinds of sequences (e.g. EEG data, ECG data). Therefore, we propose a normalized-CrossMatch approach (NCM) that extends CM to enforce normalization while maintaining the same performance capabilities.
Xueyuan Gong, Simon Fong, Yain-Whar Si, Robert P. Biuk-Aghai, Raymond K. Wong, Athanasios V. Vasilakos
Event Prediction in Healthcare Analytics: Beyond Prediction Accuracy
Abstract
During the recent few years, the United States healthcare industry is under unprecedented pressure to improve outcome and reduce cost. Many healthcare organizations are leveraging healthcare analytics, especially predictive analytics in moving towards these goals and bringing better value to the patients. While many existing event prediction models provide helpful predictions in terms of accuracy, their use are typically limited to prioritizing individual patients for care management at fixed time points. In this paper we explore Enhanced Modeling approaches around two important aspects: (1) model interpretability; (2) flexible prediction window. Better interpretability of the model will guide us towards more effective intervention design. Flexible prediction window can provide a higher resolution picture of patients’ risks of adverse events over time, and thereby enable timely interventions. We illustrate interpretation and insights from our Bayesian Hierarchical Model for readmission prediction, and demonstrate flexible prediction window with Random Survival Forests model for prediction of future emergency department visits.
Lina Fu, Faming Li, Jing Zhou, Xuejin Wen, Jinhui Yao, Michael Shepherd
Clinical Decision Support for Stroke Using Multi–view Learning Based Models for NIHSS Scores
Abstract
Cerebral stroke is a leading cause of physical disability and death in the world. The severity of a stroke is assessed by a neurological examination using a scale known as the NIH stroke scale (NIHSS). As a measure of stroke severity, the NIHSS score is widely adopted and has been found to also be useful in outcome prediction, rehabilitation planning and treatment planning. In many applications, such as in patient triage in under–resourced primary health care centres and in automated clinical decision support tools, it would be valuable to obtain the severity of stroke with minimal human intervention using simple parameters like age, past conditions and blood investigations. In this paper we propose a new model for predicting NIHSS scores which, to our knowledge, is the first statistical model for stroke severity. Our multi–view learning approach can handle data from heterogeneous sources with mixed data distributions (binary, categorical and numerical) and is robust against missing values – strengths that many other modeling techniques lack. In our experiments we achieve better predictive accuracy than other commonly used methods.
Vaibhav Rajan, Sakyajit Bhattacharya, Ranjan Shetty, Amith Sitaram, G. Vivek

Data Mining in Business and Finance (WDMBF)

Frontmatter
A Music Recommendation System Based on Acoustic Features and User Personalities
Abstract
Music recommendation attracts great attention for music providers to improve their services as the volume of new music increases quickly. It is a great challenge for users to find their interested songs from such a large size of collections. In the previous studies, common strategies can be categorized into content-based music recommendation and collaborative music filtering. Content-based recommendation systems predict users’ preferences in terms of the music content. Collaborative filtering systems predict users’ ratings based on the preferences of the friends of the targeting user. In this study, we proposed a hybrid approach to provide personalized music recommendations. This is achieved by extracting audio features of songs and integrating these features and user personalities for context-aware recommendation using the state-of-the-art support vector machines (SVM). Our experiments show the effectiveness of this proposed approach for personalized music recommendation.
Rui Cheng, Boyang Tang
A Social Spam Detection Framework via Semi-supervised Learning
Abstract
With the increasing popularity of social networking websites such as Twitter, Facebook, Sina Weibo and MySpace, spammers on them are getting more and more rampant. Social spammers always create a mass of compromised or fake accounts to deceive users and lead them to access malicious websites which contain illegal, pornography or dangerous information. As we all know, most of the studies on social spam detection are based on supervised machine learning which requires plenty of annotated datasets. Unfortunately, labeling a large number of datasets manually is a complex, error-prone and tedious task which may costs a lot of human efforts and time. In this paper, we propose a novel semi-supervised classification framework for social spam detection, which combines co-training with k-medoids. First we utilize k-medoids clustering algorithm to acquire some informative and presentative samples for labelling as our initial seeds set. Then we take advantage of the content features and behavior features of users for our co-training classification framework. In order to illustrate the effectiveness of k-medoids, we compare the performance with random selecting strategy. Finally, we evaluate the effectiveness of our proposed detection framework compared with several classical supervised algorithms.
Xianchao Zhang, Haijun Bai, Wenxin Liang
A Hierarchical Beta Process Approach for Financial Time Series Trend Prediction
Abstract
An automatic stock market categorization system would be invaluable to investors and financial experts, providing them with the opportunity to predict a stock price changes with respect to the other stocks. In recent years, clustering all companies in the stock markets based on their similarities in shape of the stock market has increasingly become popular. However, existing approaches may not be practical because the stock price data are high-dimensional data and the changes in the stock price usually occur with shift, which makes the categorization more complex. In this paper, a hierarchical beta process (HBP) based approach is proposed for stock market trend prediction. Preliminary results show that the approach is promising and outperforms other popular approaches.
Mojgan Ghanavati, Raymond K. Wong, Fang Chen, Yang Wang, Joe Lee
Efficient Iris Image Segmentation for ATM Based Approach Through Fuzzy Entropy and Graph Cut
Abstract
In order to realize accurate personal identification in the ATMS, an efficient iris image segmentation approach based on the fuzzy 4-partition entropy and graph cut is presented which can not only yield noisy segmentation results but short the running time. In this paper, an iterative calculation scheme is presented for reducing redundant computations in fuzzy 4-entropy evaluation. Then the presented algorithm uses the probabilities of 4 fuzzy events to define the costs of 4 label assignments (iris, pupil, background and eyelash) for each region in the graph cut. The final segmentation result is computed using graph cut, which produces smooth segmentation result and yields noise. The experimental results demonstrate the presented iterative calculation scheme can greatly reduce the running time. Quantitative evaluations over 20 classic iris images also show that our algorithm outperforms existing iris image segmentation approaches.
Shibai Yin, Yibin Wang, Tao Wang
Matching Product Offers of E-Shops
Abstract
E-commerce is a continuously growing and competitive market. There are several motivations for e-shoppers, sellers and manufacturers to require an automated approach for matching product offers from various online sources referring to the same or a similar real-world product. Currently, there are several approaches for the assignment of identical and similar product offers. These existing approaches are not sufficient for performing a precise comparison as they only return a similarity value for two compared products but do not give any information for further calculations and analyses. The contribution of this paper is a novel approach and an algorithm for matching identical and very similar product offers based on the pairwise comparison of the product names. For this purpose the approach uses different similarity values which are based on an existing string similarity measure. The approach is independent from a specific product domain or data source.
Andrea Horch, Holger Kett, Anette Weisbecker
Keystroke Biometric Recognition on Chinese Long Text Input
Abstract
Keystroke Biometric is useful in distinguishing legal users from perpetrators in online activities. Most previous keystroke studies focus on short text, however short text keystroke can only be used in limited scenarios such as user name and password input and provide one-time authentication. In this paper, we concentrate on how to detect whether current user is the legal one during the whole activity, such as writing an E-mail and chat online. We developed a JAVA applet to collect raw data, and then extracted features and constructed 4 classifiers. In the experiment, we required 30 users to choose a topic randomly and then type in a text about 400 Chinese characters on it. This experiment repeated 9 times in different days under the same typing environment. The accuracy of different methods shifts from 94.07 % to 98.15 %, the FAR reaches to 0.74 % and FRR to 1.15 %. In summary, Chinese free long text keystroke biometric recognition can be used to authenticate users during the whole online activity with satisfactory precision.
Xiaodong Li, Jiafen Liu
Recommendation Algorithm Design in a Land Exchange Platform
Abstract
In China the majority of the farmlands are small pieces, which should be circulated and aggregated to a larger scale and pave the way for modern farms. A Platform needs to be built to connect the small landowners and the potential new farmers or investors. This paper proposes an efficient recommendation algorithm that takes both the space attributes and other properties of farmland pieces into consideration and produce best selection for the intended potential new farmers or investors with customized object functions.
Xubin Luo, Jiang Duan
Backmatter
Metadaten
Titel
Trends and Applications in Knowledge Discovery and Data Mining
herausgegeben von
Huiping Cao
Jinyan Li
Ruili Wang
Copyright-Jahr
2016
Electronic ISBN
978-3-319-42996-0
Print ISBN
978-3-319-42995-3
DOI
https://doi.org/10.1007/978-3-319-42996-0