Skip to main content

2017 | Buch

Advanced Data Mining and Applications

13th International Conference, ADMA 2017, Singapore, November 5–6, 2017, Proceedings

herausgegeben von: Gao Cong, Wen-Chih Peng, Wei Emma Zhang, Chengliang Li, Dr. Aixin Sun

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, held in Singapore in November 2017.

The 20 full and 38 short papers presented in this volume were carefully reviewed and selected from 118 submissions. The papers were organized in topical sections named: database and distributed machine learning; recommender system; social network and social media; machine learning; classification and clustering methods; behavior modeling and user profiling; bioinformatics and medical data analysis; spatio-temporal data; natural language processing and text mining; data mining applications; applications; and demos.

Inhaltsverzeichnis

Frontmatter

Database and Distributed Machine Learning

Frontmatter
Querying and Mining Strings Made Easy

With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

Majed Sahli, Essam Mansour, Panos Kalnis
Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Taking both the algorithmic and system aspects into consideration, we develop a procedure for setting mini-batch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung, Edward Y. Chang
Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks

Early detection of faults is essential to maintaining the reliability of a distributed system. While there are many solutions for detecting faults, handling high dimensionality and uncertainty of system observations to make an accurate detection still remains a challenge. In this paper, we address this challenge with a two-dimensional convolutional neural network in the form of a denoising autoencoder with recurrent neural networks that performs simultaneous fault detection and diagnosis based on real-time system metrics from a given distributed system (e.g. CPU usage, memory consumption, etc.). The model provides a unified way to automatically learn useful features and make adaptive inferences regarding the onset of faults without hand-crafted feature extraction and human diagnostic expertise. In addition, we develop a Bayesian change-point detection approach for fault localization, in order to support the fault recovery process. We conducted extensive experiments in a real distributed environment over Amazon EC2 and the results demonstrate our proposal outperforms a variety of state-of-the-art machine learning algorithms that are used for fault detection and diagnosis in distributed systems.

Guangyang Qi, Lina Yao, Anton V. Uzunov
Discovering Group Skylines with Constraints by Early Candidate Pruning

Skyline query has been an important issue in the database community. Many applications nowadays request the skyline after grouping tuples, such as fantasy sports, so that the group skyline problem becomes the research focus. Most previous algorithms intended to quickly sift through the numerous combinations but fail to address the problem of constraints. In practice, nearly all groupings are specified with constrains, which demand solutions of constrained group skyline. In this paper, we propose an algorithm called CGSky to efficiently solve the problem. CGSky utilizes a pre-processing method to exclude the unnecessary tuples and generate candidate groups incrementally. A pruning mechanism is devised in the algorithm to prevent non-qualifying candidates from the skyline computation. Our experimental results show that CGSky improves an order of magnitude over previous algorithms in average. It also shows that CGSky has good scale-up capability on different data distributions.

Ming-Yen Lin, Yueh-Lin Lin, Sue-Chen Hsueh
Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of vector data with less than one hundred dimensions. In this paper, we revisit and investigate the performance of different MapReduce-based approximate k-NN similarity join approaches on Apache Hadoop for large volumes of high-dimensional vector data.

Přemysl Čech, Jakub Maroušek, Jakub Lokoč, Yasin N. Silva, Jeremy Starks
A Higher-Fidelity Frugal Quantile Estimator

The estimation of the quantiles is pertinent when one is mining data streams. However, the complexity of quantile estimation is much higher than the corresponding estimation of the mean and variance, and this increased complexity is more relevant as the size of the data increases. Clearly, in the context of “infinite” data streams, a computational and space complexity that is linear in the size of the data is definitely not affordable. In order to alleviate the problem complexity, recently, a very limited number of studies have devised incremental quantile estimators [7, 12]. Estimators within this class resort to updating the quantile estimates based on the most recent observation(s), and this yields updating schemes with a very small computational footprint – a constant-time (i.e., O(1)) complexity. In this article, we pursue this research direction and present an estimator that we refer to as a Higher-Fidelity Frugal [7] quantile estimator. Firstly, it guarantees a substantial advancement of the family of Frugal estimators introduced in [7]. The highlight of the present scheme is that it works in the discretized space, and it is thus a pioneering algorithm within the theory of discretized algorithms (The fact that discretized Learning Automata schemes are superior to their continuous counterparts has been clearly demonstrated in the literature. This is the first paper, to our knowledge, that proves the advantages of discretization within the domain of quantile estimation). Comprehensive simulation results show that our estimator outperforms the original Frugal algorithm in terms of accuracy.

Anis Yazidi, Hugo Lewi Hammer, B. John Oommen

Recommender System

Frontmatter
Fair Recommendations Through Diversity Promotion

We address the problem of overspecialization in streaming platform recommender systems. The personalization of web pages by delivering content to users is a challenging task in data mining. But it has been proved that beside optimizing the relevance accuracy such systems should also rely on other factors like diversity or novelty. In this paper we focus on modeling users’ boundary area of interest by selecting the most diverse items they liked in the past. We apply diversification while building the top-N list of recommendations. We select the items we want to recommend from an area where we consider a user will find item different from what she or he likes in the past. We evaluate our approach in offline analysis on two datasets, showing that our approach brings diversity and is competitive against implicit state-of-the-art method.

Pierre-René Lhérisson, Fabrice Muhlenbach, Pierre Maret
A Hierarchical Bayesian Factorization Model for Implicit and Explicit Feedback Data

Matrix factorization (MF) is one of the most efficient methods for performing collaborative filtering. An MF-based method represents users and items by latent feature vectors that are obtained by decomposing the rating matrix of users to items. However, MF-based methods suffer from the cold-start problem: if no rating data are available for an item, the model cannot find a latent feature vector for that item, and thus cannot make a recommendation for it. In this paper, we present a hierarchical Bayesian model that can infer the latent feature vectors of items directly from the implicit feedback (e.g., clicks, views, purchases) when they cannot be obtained from the rating data. We infer the full posterior distributions of these parameters using a Gibbs sampling method. We show that the proposed method is strong with overfitting even if the model is very complex or the data are very sparse. Our experiments on real-world datasets demonstrate that our proposed method significantly outperforms competing methods on rating prediction tasks, especially for very sparse datasets.

ThaiBinh Nguyen, Atsuhiro Takasu
Empirical Analysis of Factors Influencing Twitter Hashtag Recommendation on Detected Communities

Due to the limited length of tweets, hashtags are often used by users in their tweets. Thus, hashtag recommendation is highly desirable for users in Twitter to find useful hashtags when they type in tweets. However, there are many factors that may affect the effectiveness of hashtag recommendation, which includes social relationships, textual information and user profiling based on hashtag preference. In this paper, we aim to analyse the effect of these factors in hashtag recommendation on the detected communities in Twitter. In details, we seek answers to the two questions: What is the most significant factor in recommending hashtags in the context of detected communities? How the different community detection algorithms and the size of the communities affect the performance of hashtag recommendation?To answer these questions, we detect the communities using two algorithms: Breadth First Search (BFS) and Clique Percolation Method (CPM). On the randomly detected communities, we investigate the quality and the behaviour of the recommended hashtags people consumed. From the extensive experimental results, we have the following conclusions. First, social factor is the most significant factor along with the textual factor for hashtag recommendation. Second, we find that the quality of the hashtag recommendation in the community detected using CPM clearly outperforms that using BFS. Third, incorporating user profiling increases the quality of the recommended hashtags.

Areej Alsini, Amitava Datta, Jianxin Li, Du Huynh
Group Recommender Model Based on Preference Interaction

With the application of recommender system increasing, the research and application of group recommender have been paid more attention. In the course of group activities, the unknown preferences of users are often affected by other members of the group. However, in the existing group recommender system, this effect is not taken into account. In this paper, we propose a novel recommender model that incorporates the preference interaction in the group recommender into rating predicting process. The model is divided into two parts: self-prediction and preference-interaction, the preference-interaction will be systematically analyzed and illustrated. For every user in the group, we use group activity history information and recommender post-rating feedback mechanism to generate personalized interactive parameters. Thus, it can improve the group’s recommender accuracy. Finally, the model is combined with the collaborative filtering algorithm and compared with the algorithm without the model on the MovieLens dataset. The experiment results show that the model proposed in this paper can improve the accuracy of the group recommender results obviously.

Wei Zheng, Bohan Li, Yanan Wang, Hongzhi Yin, Xue Li, Donghai Guan, Xiaolin Qin
Identification of Grey Sheep Users by Histogram Intersection in Recommender Systems

Collaborative filtering, as one of the most popular recommendation algorithms, has been well developed in the area of recommender systems. However, one of the classical challenges in collaborative filtering, the problem of “Grey Sheep” user, is still under investigation. “Grey Sheep” users is a group of the users who may neither agree nor disagree with the majority of the users. They may introduce difficulties to produce accurate collaborative recommendations. In this paper, discuss the drawbacks in the approach that can identify the Grey Sheep users by reusing the outlier detection techniques based on the distribution of user-user similarities. We propose to alleviate these drawbacks and improve the identification of Grey Sheep users by using histogram intersection to better produce the user-user similarities. Our experimental results based on the MovieLens 100 K rating data demonstrate the ease and effectiveness of our proposed approach in comparison with existing approaches to identify grey sheep users.

Yong Zheng, Mayur Agnani, Mili Singh

Social Network and Social Media

Frontmatter
A Feature-Based Approach for the Redefined Link Prediction Problem in Signed Networks

Link prediction is an important research issue in social networks, which can be applied in many areas, such as trust-aware business applications and viral marketing campaigns. With the rise of signed networks, the link prediction problem becomes more complex and challenging as it introduces negative relations among users. Instead of predicting future relation for a pair of users, however, the current research focuses on distinguishing whether a certain link is positive or negative, on the premise of the link existence. The situation that two users do not have relation (i.e., no-relation) is also not considered, which actually is the most common case in reality. In this paper, we redefine the link prediction problem in signed social networks by also considering “no-relation” as a future status of a node pair. To understand the underlying mechanism of link formation in signed networks, we propose a feature framework on the basis of a thorough exploration of potential features for the newly identified problem. We find that features derived from social theories can well distinguish these three social statuses. Grounded on the feature framework, we adopt a multiclass classification model to leverage all the features, and experiments show that our method outperforms the state-of-the-art methods.

Xiaoming Li, Hui Fang, Jie Zhang
From Mutual Friends to Overlapping Community Detection: A Non-negative Matrix Factorization Approach

Community detection provides a way to unravel complicated structures in complex networks. Overlapping community detection allows nodes to be associated with multiple communities. Matrix Factorization (MF) is one of the standard tools to solve overlapping community detection problems from a global view. Existing MF-based methods only exploit link information revealed by the adjacency matrix, but ignore other critical information. In fact, compared with the existence of a link, the number of mutual friends between two nodes can better reflect their similarity regarding community membership. In this paper, based on the concept of mutual friend, we incorporate Mutual Density as a new indicator to infer the similarity of community membership between two nodes in the MF framework for overlapping community detection. We conduct data observation on real-world networks with ground-truth communities to validate an intuition that mutual density between two nodes is correlated with their community membership cosine similarity. According to this observation, we propose a Mutual Density based Non-negative Matrix Factorization (MD-NMF) model by maximizing the likelihood that node pairs with larger mutual density are more similar in community memberships. Our model employs stochastic gradient descent with sampling as the learning algorithm. We conduct experiments on various real-world networks and compare our model with other baseline methods. The results show that our MD-NMF model outperforms the other state-of-the-art models on multiple metrics in these benchmark datasets.

Xingyu Niu, Hongyi Zhang, Micheal R. Lyu, Irwin King
Calling for Response: Automatically Distinguishing Situation-Aware Tweets During Crises

Recent years have witnessed the prevalence and use of social media during crises, such as Twitter, which has been becoming a valuable information source for offering better responses to crisis and emergency situations by the authorities. However, the sheer amount of information of tweets can’t be directly used. In such context, distinguishing the most important and informative tweets is crucial to enhance emergency situation awareness. In this paper, we design a convolutional neural network based model to automatically detect crisis-related tweets. We explore the twitter-specific linguistic, sentimental and emotional analysis along with statistical topic modeling to identify a set of quality features. We then incorporate them to into a convolutional neural network model to identify crisis-related tweets. Experiments on real-world Twitter dataset demonstrate the effectiveness of our proposed model.

Xiaodong Ning, Lina Yao, Xianzhi Wang, Boualem Benatallah
Efficient Revenue Maximization for Viral Marketing in Social Networks

In social networks, the problem of revenue maximization aims at maximizing the overall revenue from the purchasing behaviors of users under the influence propagations. Previous studies use a number of simulations on influence cascades to obtain the maximum revenue. However, these simulation-based methods are time-consuming and can’t be applied to large-scale networks. Instead, we propose calculation-based algorithms for revenue maximization, which gains the maximum revenue through fast approximate calculations within local acyclic graphs instead of the slow simulations across the global network. Furthermore, a max-Heap updating scheme is proposed to prune unnecessary calculations. These algorithms are designed for both the scenarios of unlimited and constrained commodity supply. Experiments on both the synthetic and real-world datasets demonstrate the efficiency and effectiveness of our proposals, that is, our algorithms run in orders of magnitude faster than the state-of-art baselines, and meanwhile, the maximum revenue achieved is nearly not affected.

Yuan Su, Xi Zhang, Sihong Xie, Philip S. Yu, Binxing Fang
Generating Life Course Trajectory Sequences with Recurrent Neural Networks and Application to Early Detection of Social Disadvantage

Using long-running panel data from the Household, Income and Labour Dynamics in Australia (HILDA) survey collected annually between 2001 and 2015, we aim to generate a sequence of events for individuals by processing real life trajectories one step at a time and predict what comes next. This is motivated by the need for understanding and predicting forthcoming patterns from these disadvantage dynamics which are represented by multiple life-course trajectories evolutions over time. In this paper, given longitudinal trajectories created from HILDA survey waves, we develop a model with Long Short-term Memory recurrent neural networks to generate complex trajectory sequences with long-range structure. Our method uses a multi-layered Long Short-Term Memory (LSTM) approach to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. The generated sequences over time use the social exclusion monitor (SEM) indicator to determine the level of social disadvantage for each individual. The sequences are encoded by predefined social exclusion factors, which are binary values to indicate the occurrence of corresponding factors. To model the correlations among social exclusion domains, we use the Mixture Density Networks which are parameterized by the outputs of LSTM. Our main result is the high prediction accuracy on personal life course trajectories created from real HILDA data. Moreover, the proposed model can synthesize, and impute some missing trajectories given partial observations from respondent individuals. More importantly, we examine the relative roles of different advantage dimensions in explaining changes in life trajectories in Australia, and find that the domains of employment, education, community and personal safety are highly correlated to the decreased disadvantage measurement. While, domains regarding material resources, health and social support are of direct relevance to increase social disadvantage with varied contribution extent.

Lin Wu, Michele Haynes, Andrew Smith, Tong Chen, Xue Li
FRISK: A Multilingual Approach to Find twitteR InterestS via wiKipedia

Several studies have shown that the users of Twitter reveal their interests (i.e., what they like) while they share their opinions, preferences and personal stories.In this paper we describe Frisk a multilingual unsupervised approach for the categorization of the interests of Twitter users. Frisk models the tweets of a user and the interests (e.g., politics, sports) as bags of articles and categories of Wikipedia respectively, and ranks the interests by relevance, measured as the graph distance between the articles and the categories. To the best of our knowledge, existing unsupervised approaches do not address multilingualism and describe the users’ interests through bags of words (e.g., phone, apps), without a precise categorization (e.g., technology).We evaluated Frisk on a dataset including 1,347 users and more than three million tweets written in four different languages (English, French, Italian and Spanish). The results indicate that Frisk shows quantitative promise, also compared to approaches based on text classification (SVM, Naive Bayes and Random Forest) and LDA.

Coriane Nana Jipmo, Gianluca Quercini, Nacéra Bennacer
A Solution to Tweet-Based User Identification Across Online Social Networks

User identification can help us build better users’ profiles and benefit many applications. It has attracted many scholars’ attention. The existing works with good performance are mainly based on the rich online data. However, due to the privacy settings, it is costless or even difficult to obtain the rich data. Besides some profile attributes do not require exclusivity and are easily faked by users for different purposes. This makes the existing schemes are quite fragile. Users often publicly publish their activities on different social networks. This provides a way to overcome the above problem. We aim to address the user identification only based on users’ tweets. We first formulate the user identification based on tweets and propose a tweet-based user identification model. Then a supervised machine learning based solution is presented. It consists of three key steps: first, we propose several algorithms to measure the spatial similarity, temporal similarity and content similarity of two tweets; second, we extract the spatial, temporal and content features to exploit information redundancies; Afterwards, we employ the machine learning method for user identification. The experiment shows that the proposed solution can provide excellent performance with F1 values reaching 89.79%, 86.78% and 86.24% on three ground truth datasets, respectively. This work shows the possibility of user identification with easily accessible and not easily impersonated online data.

Yongjun Li, Zhen Zhang, You Peng

Machine Learning

Frontmatter
Supervised Feature Selection Algorithm Based on Low-Rank and Manifold Learning

In this paper we show that manifold learning could effectively find the essential dimension of nonlinear high-dimensional data, but it could not use class label information of the data because it is an unsupervised learning method. This paper explores a novel supervised feature selection algorithm based on low-rank and manifold learning. Specifically, we obtain the coefficient matrix according to the relationship between data and class label. Then we combine sparse learning and manifold learning to conduct feature selection. Finally, we use the low-rank representation to further adjust the result of feature selection. Experimental results show that our new method obtains the best results on the four public datasets when compared with six existing methods.

Yue Fang, Jilian Zhang, Shichao Zhang, Cong Lei, Xiaoyi Hu
Mixed Membership Sparse Gaussian Conditional Random Fields

Building statistical models to explain the association between responses (output) and predictors (input) is critical in many real applications. In reality, responses may not be independent. A promising direction is to predict related responses together (e.g. Multi-task LASSO). However, not all responses have the same degree of relatedness. Sparse Gaussian conditional random field (SGCRF) was developed to learn the degree of relatedness automatically from the samples without any prior knowledge. In real cases, features (both predictors and responses) are not arbitrary, but are dominated by a (smaller) set of related latent factors, e.g. clusters. SGCRF does not capture these latent relations in the model. Being able to model these relations could result in more accurate association between responses and predictors. In this paper, we propose a novel (mixed membership) hierarchical Bayesian model, namely M$$^2$$GCRF, to capture this phenomenon (in terms of clusters). We develop a variational Expectation-Maximization algorithm to infer the latent relations and association matrices. We show that M$$^2$$GCRF clearly outperforms existing methods for both synthetic and real datasets, and the association matrices identified by M$$^2$$GCRF are more accurate.

Jie Yang, Henry C. M. Leung, S. M. Yiu, Francis Y. L. Chin
Effects of Dynamic Subspacing in Random Forest

Due to its simplicity and good performance, Random Forest attains much interest from the research community. The splitting attribute at each node of a decision tree for Random Forest is determined from a predefined number of randomly selected attributes (a subset of the entire attribute set). The size of an attribute subset (subspace) is one of the most important factors that stems multitude of influences over Random Forest. In this paper, we propose a new technique that dynamically determines the size of subspaces based on the relative size of the current data segment to the entire data set. In order to assess the effects of the proposed technique, we conduct experiments involving five widely used data set from the UCI Machine Learning Repository. The experimental results indicate the capability of the proposed technique on improving the ensemble accuracy of Random Forest.

Md Nasim Adnan, Md Zahidul Islam
Diversity and Locality in Multi-Component, Multi-Layer Predictive Systems: A Mutual Information Based Approach

This paper discusses the effect of locality and diversity among the base models of a Multi-Components Multi-Layer Predictive System (MCMLPS). A new ensemble method is introduced, where in the proposed architecture, the data instances are assigned to local regions using a conditional mutual information based on the similarity of their features. Furthermore, the outputs of the base models are weighted by this similarity metric. The proposed architecture has been tested on a number of data sets and its performance was compared to four benchmark algorithms. Moreover, the effect of changing three parameters of the proposed architecture has been tested and compared.

Bassma Al-Jubouri, Bogdan Gabrys
Hybrid Subspace Mixture Models for Prediction and Anomaly Detection in High Dimensions

Robust learning of mixture models in high dimensions remains an open challenge and especially so in current big data era. This paper investigates twelve variants of hybrid mixture models that combine the G-means clustering, Gaussian, and Student t-distribution mixture models for high-dimensional predictive modeling and anomaly detection. High-dimensional data is first reduced to lower-dimensional subspace using whitened principal component analysis. For real-time data processing in batch mode, a technique based on Gram-Schmidt orthogonalization process is proposed and demonstrated to update the reduced dimensions to remain relevant in fulfilling the task objectives. In addition, a model-adaptation technique is proposed and demonstrated for big data incremental learning by statistically matching the mixture components’ mean and variance vectors; the adapted parameters are computed based on weighted average that takes into account the sample size of new and older statistics with a parameter to scale down the influence of older statistics in each iterative computation. The hybrid models’ performance are evaluated using simulation and empirical studies. Results show that simple hybrid models without the Expectation-Maximization training step can achieve equally high performance in high dimensions that is comparable to the more sophisticated models. For unsupervised anomaly detection, the hybrid models achieve detection rate $$\gtrsim 90\%$$ with injected anomalies from $$1\%$$ to $$60\%$$ using the KDD Cup 1999 network intrusion dataset.

Jenn-Bing Ong, Wee-Keong Ng

Classification and Clustering Methods

Frontmatter
StruClus: Scalable Structural Graph Set Clustering with Representative Sampling

We present a structural clustering algorithm for large-scale datasets of small labeled graphs, utilizing a frequent subgraph sampling strategy. A set of representatives provides an intuitive description of each cluster, supports the clustering process, and helps to interpret the clustering results. The projection-based nature of the clustering approach allows us to bypass dimensionality and feature extraction problems that arise in the context of graph datasets reduced to pairwise distances or feature vectors. While achieving high quality and (human) interpretable clusterings, the runtime of the algorithm only grows linearly with the number of graphs. Furthermore, the approach is easy to parallelize and therefore suitable for very large datasets. Our extensive experimental evaluation on synthetic and real world datasets demonstrates the superiority of our approach over existing structural and subspace clustering algorithms, both, from a runtime and quality point of view.

Till Schäfer, Petra Mutzel
Employing Hierarchical Clustering and Reinforcement Learning for Attribute-Based Zero-Shot Classification

Zero-shot classification (ZSC) is a hot topic of computer vision. Because the training labels are totally different from the testing labels, ZSC cannot be dealt with by classical classifiers. Attribute-based classifier is a dominant solution for ZSC. It employs attribute annotations to bridge training labels and testing labels, making it able to realize ZSC. Classical attribute-based classifiers treat different attributes equally. However, the attributes contribute to classification unequally. In this paper, a novel attribute-based classifier for ZSC named HCRL is proposed. HCRL utilizes hierarchical clustering to obtain a hierarchy from the attribute annotations. Then the attribute annotations are decomposed into hierarchical rules which contain only a few attributes. The discriminative abilities of the rules reflect the significances of attributes to classification, but there are no training samples for evaluating the rules. The discriminative abilities are determined by reinforcement learning during the testing and the most discriminative rules are picked out for classification. Experiments conducted on 2 popular datasets for ZSC show the competitiveness of HCRL.

Bin Liu, Li Yao, Junfeng Wu, Xiaosheng Feng
Environmental Sound Recognition Using Masked Conditional Neural Networks

Neural network based architectures used for sound recognition are usually adapted from other application domains, which may not harness sound related properties. The ConditionaL Neural Network (CLNN) is designed to consider the relational properties across frames in a temporal signal, and its extension the Masked ConditionaL Neural Network (MCLNN) embeds a filterbank behavior within the network, which enforces the network to learn in frequency bands rather than bins. Additionally, it automates the exploration of different feature combinations analogous to handcrafting the optimum combination of features for a recognition task. We applied the MCLNN to the environmental sounds of the ESC-10 dataset. The MCLNN achieved competitive accuracies compared to state-of-the-art convolutional neural networks and hand-crafted attempts.

Fady Medhat, David Chesmore, John Robinson
Analyzing Performance of Classification Techniques in Detecting Epileptic Seizure

Epileptic seizure detection is a challenging research topic. The objective of this research is to analyze the performance of various classification techniques while detecting the epileptic seizure in a shorter time. In this paper, we apply four different types of classifiers-two are black-box (SVM & KNN) and other two are non-black-box (Decision tree & Ensemble) on two epileptic patient seizure data sets. Our finding shows that non-black box classifiers, specifically ensemble classifiers, do better than other classifiers. The experimental results indicate that the ensemble classifier can assist for seizure detection in a shorter epoch length of time (i.e., 0.5 s) with high accuracy rate. Significantly in comparison to other classifiers the ensemble classifier provides high accuracy and less chance of false detection rate.

Mohammad Khubeb Siddiqui, Md Zahidul Islam, Muhammad Ashad Kabir
A Framework for Clustering and Dynamic Maintenance of XML Documents

Web data clustering has been widely studied in the data mining communities. However, dynamic maintenance of the web data clusters is still a challenging task. In this paper, we propose a novel framework called XClusterMaint which serves for both clustering and maintenance of the XML documents. For clustering, we take both structure and content into account and propose an efficient solution for grouping the documents based on the combination of structure and content similarity. For maintenance, we propose an incremental approach for maintaining the existing clusters dynamically when we receive new incoming XML documents. Since the dynamic maintenance of the clusters is computationally expensive, we also propose an improved approach which uses a lazy maintenance scheme to improve the performance of the clusters maintenance. The experimental results on real datasets verify the efficiency of the proposed clustering and maintenance model.

Ahmed Al-Shammari, Chengfei Liu, Mehdi Naseriparsa, Bao Quoc Vo, Tarique Anwar, Rui Zhou
Language-Independent Twitter Classification Using Character-Based Convolutional Networks

Most research on Twitter classification is focused on tweets in English. But Twitter supports over 40 languages and about 50% of tweets are non-English tweets. To fully use the Twitter contents, it is important to develop classifiers that can classify multilingual tweets or tweets of mixed languages (for example tweets mainly in Chinese but containing English words). The translation-based model is a classical approach to achieving multilingual or cross-lingual text classification. Recently character-based neural models are shown to be effective for text classification. But they are designed for limited European languages and require identification of languages to build an alphabet to encode and quantize characters. In this paper, we propose UniCNN (Unicode character Convolutional Networks), a fully language-independent character-based CNN model for the classification of tweets in multiple languages and mixed languages, not requiring language identification. Specifically, we propose to encode the sequence of characters in a tweet into a sequence of numerical UTF-8 codes, and then train a character-based CNN classifier. In addition, a character-based embedding layer is included before the convolutional layer for learning distributed character representation. We conducted experiments on Twitter datasets for multilingual sentiment classification in six languages and for mixed-language informativeness classification in over 40 languages. Our experiments showed that UniCNN mostly performed better than state-of-the-art neural models and traditional feature-based models, while not requiring the extra burden of any translation or tokenization.

Shiwei Zhang, Xiuzhen Zhang, Jeffrey Chan

Behavior Modeling and User Profiling

Frontmatter
Modeling Check-In Behavior with Geographical Neighborhood Influence of Venues

With many users adopting location-based social networks (LBSNs) to share their daily activities, LBSNs become a gold mine for researchers to study human check-in behavior. Modeling such behavior can benefit many useful applications such as urban planning and location-aware recommender systems. Unlike previous studies [4, 6, 12, 17] that focus on the effect of distance on users checking in venues, we consider two venue-specific effects of geographical neighborhood influence, namely, spatial homophily and neighborhood competition. The former refers to the fact that venues share more common features with their spatial neighbors, while the latter captures the rivalry of a venue and its nearby neighbors in order to gain visitation from users. In this paper, through an extensive empirical study, we show that these two geographical effects, together with social homophily, play significant roles in understanding users’ check-in behaviors. From the observation, we then propose to model users’ check-in behavior by incorporating these effects into a matrix factorization-based framework. To evaluate our proposed models, we conduct check-in prediction task and show that our models outperform the baselines. Furthermore, we discover that neighborhood competition effect has more impact to the users’ check-in behavior than spatial homophily. To the best of our knowledge, this is the first study that quantitatively examine the two effects of geographical neighborhood influence on users’ check-in behavior.

Thanh-Nam Doan, Ee-Peng Lim
An Empirical Study on Collective Online Behaviors of Extremist Supporters

Online social media platforms such as Twitter have been found to be misused by extremist groups, including Islamic State of Iraq and Syria (ISIS), who attract and recruit social media users. To prevent their influence from expanding in the online social media platforms, it is required to understand the online behaviors of these extremist group users and their followers, for predicting and identifying potential security threats. We present an empirical study about ISIS followers’ online behaviors on Twitter, proposing to classify their tweets in terms of political and subjectivity polarities. We first develop a supervised classification model for the polarity classification, based on natural language processing and clustering methods. We then develop a statistical analysis of term-polarity correlations, which leads us to successfully observe ISIS followers’ online behaviors, which are in line with the reports of experts.

Jung-jae Kim, Yong Liu, Wee Yong Lim, Vrizlynn L. L. Thing
Your Moves, Your Device: Establishing Behavior Profiles Using Tensors

Smartphones became a person’s constant companion. As the strictly personal devices they are, they gradually enable the replacement of well established activities as for instance payments, two factor authentication or personal assistants. In addition, Internet of Things (IoT) gadgets extend the capabilities of the latter even further. Devices such as body worn fitness trackers allow users to keep track of daily activities by periodically synchronizing data with the smartphone and ultimately with the vendor’s computational centers in the cloud. These fitness trackers are equipped with an array of sensors to measure the movements of the device, to derive information as step counts or make assessments about sleep quality. We capture the raw sensor data from wrist-worn activity trackers to model a biometric behavior profile of the carrier. We establish and present techniques to determine rather the original person, who trained the model, is currently wearing the bracelet or another individual. Our contribution is based on CANDECOMP/PARAFAC (CP) tensor decomposition so that computational complexity facilitates: the execution on light computational devices on low precision settings, or the migration to stronger CPUs or to the cloud, for high to very high granularity. This precision parameter allows the security layer to be adaptable, in order to be compliant with the requirements set by the use cases. We show that our approach identifies users with high confidence.

Eric Falk, Jérémy Charlier, Radu State
An Approach for Identifying Author Profiles of Blogs

Author profile identification has been an important research problem in the areas of web mining, network public opinion monitoring and social network analysis. The aim of this problem is to identify characteristics or traits of authors of textual information such as blogs, microblogs or reviews in social network platforms or commercial platforms. The technology of author profile identification can be employed into many applications including cyberspace forensics, electronic commerce and information security. In this paper, we propose a hybrid framework or technique to solve the author profile identification problem. In this framework, we design a distributed integrated representation approach of blogs based on Doc2vec and term frequency-inverse document frequency, and apply the convolutional neural network to predict age, gender and education status of authors of blogs. The benefit of our technique is that it predicts three different traits of authors in a uniform way, is an unsupervised method which can learn representation vectors of blog posts based on unlabeled data, and does not need any syntactic and semantic parsing of sentences. Experimental results on blogs show that our approach achieves a promising performance.

Chunxia Zhang, Yu Guo, Jiayu Wu, Shuliang Wang, Zhendong Niu, Wen Cheng
Generating Topics of Interests for Research Communities

With ever increasing number of publication venues and research topics, it is becoming difficult for users to find out topics of interest for conferences or research areas. Although we have many popular topic modeling techniques, we still find that conferences are listing their topics of interest using a manual approach. Topics that are generated by existing topic modeling algorithms are good for text categorization, but they are not ideal for displaying to users because they generate topics that are not so readable and are often redundant. In this paper, we propose a novel technique to generate topics of interest using association mining and natural language processing. We show that the topics of interest that are generated by our technique is much more similar to manually written topics of interest compared to existing topic modeling algorithms. Our results show that the proposed method generates meaningful, interpretable topics, and leads to 13.9% higher precision than existing techniques.

Nagendra Kumar, Rahul Utkoor, Bharath K. R. Appareddy, Manish Singh
An Evolutionary Approach for Learning Conditional Preference Networks from Inconsistent Examples

Conditional Preference Networks (CP-nets) have been proposed for modeling and reasoning about combinatorial decision domains. However, the study of CP-nets learning has not advanced sufficiently for their widespread use in complex, real-world applications where the problem is large-scale and the data is not clean. In many real world applications, due to either the randomness of the users’ behaviors or the observation errors, the data-set in hand could be inconsistent, i.e., there exists at least one outcome preferred over itself in the data-set. In this work, we present an evolutionary-based method for solving the CP-net learning problem from inconsistent examples. Here, we do not learn the CP-nets directly. Instead, we frame the problem of learning into an optimization problem and use the power of evolutionary algorithms to find the optimal CP-net. The experiments indicate that the proposed approach is able to find a good quality CP-net and outperforms the current state-of-the-art algorithms in terms of both sample agreement and graph similarity.

Mohammad Haqqani, Xiaodong Li

Bioinformatic and Medical Data Analysis

Frontmatter
Predicting Clinical Outcomes of Alzheimer’s Disease from Complex Brain Networks

Brain network modelling has been shown effective to study the brain connectivity in Alzheimer’s disease (AD). Although the topological features of AD affected brain networks have been widely investigated, combining hierarchical networks features for predicting AD receives little attention. In this study, we propose a spectral convolutional neural network (SCNN) framework to learn combinations of hierarchical network features for a reliable AD prediction outcomes. Due to the complex high-dimensional structure of brain networks, conventional convolutional neural networks (CNN) are not able to learn the complete geometrical information of brain networks. To address this limitation, our SCNN is spectrally designed to learn a complete set of network topological features. Specifically, we construct structural brain networks using magnetic resonance images (MRI) from 288 ADs, 272 mild cognitive impairments (MCI) and 272 normal controls (NC). Then, we deploy SCNN to classify ADs from MCIs and NCs. Experiment results show that SCNN is able to achieve the accuracy of 91.07% in AD/NC classification, 87.72 in AD/MCI classification and 85.45% in MCI/HC classification. In addition, we show that SCNN is able to predict clinical scores associated with AD with high precision.

Xingjuan Li, Yu Li, Xue Li
Doctoral Advisor or Medical Condition: Towards Entity-Specific Rankings of Knowledge Base Properties

In knowledge bases such as Wikidata, it is possible to assert a large set of properties for entities, ranging from generic ones such as name and place of birth to highly profession-specific or background-specific ones such as doctoral advisor or medical condition. Determining a preference or ranking in this large set is a challenge in tasks such as prioritisation of edits or natural-language generation. Most previous approaches to ranking knowledge base properties are purely data-driven, that is, as we show, mistake frequency for interestingness. In this work, we have developed a human-annotated dataset of 350 preference judgments among pairs of knowledge base properties for fixed entities. From this set, we isolate a subset of pairs for which humans show a high level of agreement (87.5% on average). We show, however, that baseline and state-of-the-art techniques achieve only 61.3% precision in predicting human preferences for this subset. We then develop a technique based on a combination of general frequency, applicability to similar entities and semantic similarity that achieves 74% precision. The preference dataset is available at https://www.kaggle.com/srazniewski/wikidatapropertyranking.

Simon Razniewski, Vevake Balaraman, Werner Nutt
Multiclass Lung Cancer Diagnosis by Gene Expression Programming and Microarray Datasets

There are various types of lung cancer and they can be differentiated by the cell size as well as the growth pattern. They are all treated differently. Classification of the various types of lung cancer assists in determining the specified treatments to decrease the fatality rates. In this paper, we broaden the analysis of lung by using gene expression data, binary decomposition strategies and Gene Expression Programming (GEP) technique, aiming at achieving better classification performance. Classification performance was assessed and compared between our GEP models and three representative machine learning techniques, SVM, NNW and C4.5 on real microarray Lung tumor datasets. Dependability was evaluated by the cross-informational collection validation. The evaluation results demonstrate that our technique can achieve better classification performance in terms of Accuracy, standard deviation and range under the recipient working trademark bend. The proposed technique in this paper provides a helpful tool for Lung cancer classification.

Hasseeb Azzawi, Jingyu Hou, Russul Alanni, Yong Xiang, Rana Abdu-Aljabar, Ali Azzawi
Drug-Drug Interaction Extraction via Recurrent Neural Network with Multiple Attention Layers

Drug-drug interaction (DDI) is a vital information when physicians and pharmacists intend to co-administer two or more drugs. Thus, several DDI databases are constructed to avoid mistakenly drug combined use. In recent years, automatically extracting DDIs from biomedical text has drawn researchers’ attention. However, the existing work utilize either complex feature engineering or NLP tools, both of which are insufficient for sentence comprehension. Inspired by the deep learning approaches in natural language processing, we propose a recurrent neural network model with multiple attention layers for DDI classification. We evaluate our model on 2013 SemEval DDIExtraction dataset. The experiments show that our model classifies most of the drug pairs into correct DDI categories, which outperforms the existing NLP or deep learning methods.

Zibo Yi, Shasha Li, Jie Yu, Yusong Tan, Qingbo Wu, Hong Yuan, Ting Wang

Spatio-Temporal Data

Frontmatter
People-Centric Mobile Crowdsensing Platform for Urban Design

With the inevitability of urbanization, it is of critical importance to understand how effective the urban spaces are planned and designed to build comfortable and lively smart cities. Our approach is to develop a people-centric mobile crowdsensing platform to provide insights for urban designers, by leveraging on the proliferation of mobile phones and recent advancements in mobile sensing and data analytics technologies. More specifically, we have designed and developed a smart-phone based platform to collect both user-generated data and data from multiple sensors contributed by various demographic groups, especially the aged, to understand how they perceive and utilize public spaces. The data collection is also conducted in a privacy-aware manner. Based on the collected data, we then develop advanced and dedicated analytics tools to derive insights about users’ opinions towards public spaces in their neighborhood, utilization of public spaces and mobility patterns of the different demographic groups, etc. These insights will be utilized to enhance urban design of future smart towns.

Shili Xiang, Lu Li, Si Min Lo, Xiaoli Li
Long-Term User Location Prediction Using Deep Learning and Periodic Pattern Mining

In recent years, with the advances in mobile communication and growing popularity of the fourth-generation mobile network along with the enhancement in location positioning techniques, mobile devices have generated extensive spatial trajectory data, which represent the mobility of moving objects. New services are emerged to serve mobile users based on their predicted locations. Most of the existing studies on location prediction were focused on predicting the next location of a user, which is regarded as short-term next location prediction. While more advanced location-based services could be enabled for the users if long-term location prediction could be achieved, the existing methods constrained in next-location prediction are not applicable for long-term prediction scenario. In this paper, we propose a novel prediction framework named LSTM-PPM that utilises deep learning and periodic pattern mining for long-term prediction of user locations. Our framework devises the ideology from natural language model and uses multi-step recursive strategy to perform long-term prediction. Furthermore, the periodic pattern mining technique is utilized to reduce the accumulated loss in the multi-step strategy. Through empirical evaluation on a real-life trajectory dataset, our proposed approach is shown to provide effective performance in long-term location prediction. To the best of our knowledge, this is the first work addressing the research topic on long-term user location prediction.

Mun Hou Wong, Vincent S. Tseng, Jerry C. C. Tseng, Sun-Wei Liu, Cheng-Hung Tsai
An Intelligent Weighted Fuzzy Time Series Model Based on a Sine-Cosine Adaptive Human Learning Optimization Algorithm and Its Application to Financial Markets Forecasting

Financial forecasting is an extremely challenging task given the complex, nonlinear nature of financial market systems. To overcome this challenge, we present an intelligent weighted fuzzy time series model for financial forecasting, which uses a sine-cosine adaptive human learning optimization (SCHLO) algorithm to search for the optimal parameters for forecasting. New weighted operators that consider frequency based chronological order and stock volume are analyzed, and SCHLO is integrated to determine the effective intervals and weighting factors. Furthermore, a novel short-term trend repair operation is developed to complement the final forecasting process. Finally, the proposed model is applied to four world major trading markets: the Dow Jones Index (DJI), the German Stock Index (DAX), the Japanese Stock Index (NIKKEI), and Taiwan Stock Index (TAIEX). Experimental results show that our model is consistently more accurate than the state-of-the-art baseline methods. The easy implementation and effective forecasting performance suggest our proposed model could be a favorable market application prospect.

Ruixin Yang, Mingyang Xu, Junyi He, Stephen Ranshous, Nagiza F. Samatova
Mobile Robot Scheduling with Multiple Trips and Time Windows

We consider a vehicle routing problem with multiple trips and time windows (VRPMTTW) in which a mobile robot transports materials from a central warehouse to multiple demanding places. The robot needs to strictly satisfy the time windows at demanding places and it can run multiple trips. How to effectively scheduling the robot is a key problem in operations of Smart Nations and intelligent automated manufacturing. In the literature three-index mixed integer programming models are developed. However, these three-index models are difficult to solve in reasonable time for real problems due to computational complexity of integer programming. We propose an innovative two-index mixed integer programming model. The numerical results show our model can successfully obtain optimal solutions fast for cases where the existing literature has not found the optimal solution yet. To our best knowledge, it is the first two-index model for this type of problems.

Shudong Liu, Huayu Wu, Shili Xiang, Xiaoli Li

Natural Language Processing and Text Mining

Frontmatter
Feature Analysis for Duplicate Detection in Programming QA Communities

In community question answering (CQA), duplicate questions are questions that were previously created and answered but occur again. These questions produce noises in the CQA websites which impede users to find answers efficiently. Programming CQA (PCQA), a branch of CQA that holds questions related to programming, also suffers from this problem. Existing works on duplicate detection in PCQA websites framed the task as a supervised learning task on the question pairs, and relied on a number of extracted features of the question pairs. But they extracted only textual features and did not consider the source code in the questions, which are linguistically very different to natural languages. Our work focuses on developing novel features for PCQA duplicate detection. We leverage continuous word vectors from the deep learning literature, probabilistic models in information retrieval and association pairs mined from duplicate questions using machine translation. We provide extensive empirical analysis on the performance of these features and their various combinations using a range of learning models. Our work could be helpful for both research works and practical applications that require extracting features from texts that are not all natural languages.

Wei Emma Zhang, Quan Z. Sheng, Yanjun Shu, Vanh Khuyen Nguyen
A Joint Human/Machine Process for Coding Events and Conflict Drivers

Constructing datasets to analyse the progression of conflicts has been a longstanding objective of peace and conflict studies research. In essence, the problem is to reliably extract relevant text snippets and code (annotate) them using an ontology that is meaningful to social scientists. Such an ontology usually characterizes either types of violent events (killing, bombing, etc.), and/or the underlying drivers of conflict, themselves hierarchically structured, for example security, governance and economics, subdivided into conflict-specific indicators. Numerous coding approaches have been proposed in the social science literature, ranging from fully automated “machine” coding to human coding. Machine coding is highly error prone, especially for labelling complex drivers, and suffers from extraction of duplicated events, but human coding is expensive, and suffers from inconsistency between annotators; thus hybrid approaches are required. In this paper, we analyse experimentally how human input can most effectively be used in a hybrid system to complement machine coding. Using two newly created real-world datasets, we show that machine learning methods improve on rule-based automated coding for filtering large volumes of input, while human verification of relevant/irrelevant text leads to improved performance of machine learning for predicting multiple labels in the ontology.

Bradford Heap, Alfred Krzywicki, Susanne Schmeidl, Wayne Wobcke, Michael Bain
Quality Prediction of Newly Proposed Questions in CQA by Leveraging Weakly Supervised Learning

Community Question Answering (CQA) websites provide a platform to ask questions and share their knowledge. Good questions in CQA websites can improve user experiences and attract more users. To the best of our knowledge, a few researches have been studied on the question quality, especially the quality of newly proposed questions. In this work, we consider that a good question is popular and answerable in CQA websites. The community features of questions are extracted automatically and utilized to acquire massive good questions. The text features and asker features of good questions are utilized to train our weakly supervised model based on Convolutional Neural Network to recognize good newly proposed questions. We conduct extensive experiments on the publicly available dataset from StackExchange and our best result achieves F1-score at 91.5%, outperforming the baselines.

Yuanhao Zheng, Bifan Wei, Jun Liu, Meng Wang, Weitong Chen, Bei Wu, Yihe Chen
Improving Chinese Sentiment Analysis via Segmentation-Based Representation Using Parallel CNN

Automatically analyzing sentimental implications in texts relies on well-designed models utilizing linguistic features. Therefore, the models are mostly language-dependent and designed for English texts. Chinese is with the largest users in the world and has a tremendous amount of texts daily generated from the social media, etc. However, it has seldom been studied. On another hand, a general observation, which is valid in many languages, is that different segments of a piece of text, e.g. a clause, having different sentimental polarities. The existing deep learning models neglect the imbalanced sentiment distribution and only take the entire piece of the text. This paper proposes a novel sentiment-analysis model, which is capable of sentiment analysis task in Chinese. Firstly, the model segments a text into smaller units according to the punctuations to obtain the preliminary text representation, and this step is so-called segmentation-based representation. Meanwhile, its new framework parallel-CNN (convolutional neural network) simultaneously use all segments. This model, we call SBR-PCNN, concatenate the representation of each segment to obtain the final representation of the text which does not only contain the semantic and syntactic features but also retains the essential sequential information. The proposed method has been evaluated on two Chinese sentiment classification datasets and compared with a broad range of baselines. Experimental results show that the proposed approach achieves the state of the art results on two benchmarking datasets. Meanwhile, they demonstrate that our model may improve the performance of Chinese sentiment analysis.

Yazhou Hao, Qinghua Zheng, Yangyang Lan, Yufei Li, Meng Wang, Sen Wang, Chen Li
Entity Recognition by Distant Supervision with Soft List Constraint

Supervised named entity recognition systems often suffer from training data inadequacy when deal with domain specific corpora, e.g., documents in medical and healthcare. For these domains, obtaining some seed words or phrases is not very difficult. Then, some positive instances obtained through distant supervision based on the seeds can be used to learn recognition models. However, with the limited size of training samples and no negative ones, the classifying results may not be satisfying. In this paper, we leverage the conjunction and comma writing style as the list constraint to enlarge the set of training instances. Different from earlier studies, we formulate two kinds of constraints, namely, soft list constraint and mention constraint, as regularizers. We then incorporate the constraints to a unified discriminative learning framework and propose a joint optimization algorithm. The experimental results show that our model is superior than state-of-the-art baselines on a large collection of documents about drugs.

Hongkui Tu, Zongyang Ma, Aixin Sun, Zhiqiang Xu, Xiaodong Wang
Structured Sentiment Analysis

Extracting the latent structure of the aspects and the sentiment polarities is important as it helps customers to understand people’ preference to a certain product and show the reasons why they prefer this product. However, insufficient studies have been done to effectively reveal the structure sentiment of the aspects from short texts due to the shortness and sparsity. In this paper, we propose a structured sentiment analysis (SSA) approach to understand the sentiments and opinions expressed by people in short texts. The proposed SSA approach has three advantages: (1) automatically extracts a hierarchical tree of a product’s hot aspects from short texts; (2) hierarchically analyses people’s opinions on those aspects; and (3) generates a summary and evidences of the results. We evaluate our approach on popular products. The experimental results show that the proposed approach can effectively extract a sentiment tree from short texts.

Abdulqader Almars, Xue Li, Xin Zhao, Ibrahim A. Ibrahim, Weiwei Yuan, Bohan Li

Data Mining Applications

Frontmatter
Improving Real-Time Bidding Using a Constrained Markov Decision Process

Online advertising is increasingly switching to real-time bidding on advertisement inventory, in which the ad slots are sold through real-time auctions upon users visiting websites or using mobile apps. To compete with unknown bidders in such a highly stochastic environment, each bidder is required to estimate the value of each impression and to set a competitive bid price. Previous bidding algorithms have done so without considering the constraint of budget limits, which we address in this paper. We model the bidding process as a Constrained Markov Decision Process based reinforcement learning framework. Our model uses the predicted click-through-rate as the state, bid price as the action, and ad clicks as the reward. We propose a bidding function, which outperforms the state-of-the-art bidding functions in terms of the number of clicks when the budget limit is low. We further simulate different bidding functions competing in the same environment and report the performances of the bidding strategies when required to adapt to a dynamic environment.

Manxing Du, Redouane Sassioui, Georgios Varisteas, Radu State, Mats Brorsson, Omar Cherkaoui
PowerLSTM: Power Demand Forecasting Using Long Short-Term Memory Neural Network

Power demand forecasting is a critical task to achieve efficiency and reliability in the smart grid in terms of demand response and resource allocation. This paper proposes PowerLSTM, a power demand forecasting model based on Long Short-Term Memory (LSTM) neural network. We calculate the feature significance and compact our model by capturing the features with the most important weights. Based on our preliminary study using a public dataset, compared to two recent works based on Gradient Boosting Tree (GBT) and Support Vector Regression (SVR), PowerLSTM demonstrates a decrease of 21.80% and 28.57% in forecasting error, respectively. Our study also reveals that metering/forecasting granularity at once every 30 min can bring higher accuracy than other practical granularity options.

Yao Cheng, Chang Xu, Daisuke Mashima, Vrizlynn L. L. Thing, Yongdong Wu
Identifying Unreliable Sensors Without a Knowledge of the Ground Truth in Deceptive Environments

This paper deals with the extremely fascinating area of “fusing” the outputs of sensors without any knowledge of the ground truth. In an earlier paper, the present authors had recently pioneered a solution, by mapping it onto the fascinating paradox of trying to identify stochastic liars without any additional information about the truth. Even though that work was significant, it was constrained by the model in which we are living in a world where “the truth prevails over lying”. Couched in the terminology of Learning Automata (LA), this corresponds to the Environment (Since the Environment is treated as an entity in its own right, we choose to capitalize it, rather than refer to it as an “environment”, i.e., as an abstract concept.) being “Stochastically Informative”. However, as explained in the paper, solving the problem under the condition that the Environment is “Stochastically Deceptive”, as opposed to informative, is far from trivial. In this paper, we provide a solution to the problem where the Environment is deceptive (We are not aware of any other solution to this problem (within this setting), and so we believe that our solution is both pioneering and novel.), i.e., when we are living in a world where “lying prevails over the truth”.

Anis Yazidi, B. John Oommen, Morten Goodwin
Color-Sketch Simulator: A Guide for Color-Based Visual Known-Item Search

In order to evaluate the effectiveness of a color-sketch retrieval system for a given multimedia database, tedious evaluations involving real users are required as users are in the center of query sketch formulation. However, without any prior knowledge about the bottlenecks of the underlying sketch-based retrieval model, the evaluations may focus on wrong settings and thus miss the desired effect. Furthermore, users have usually no clues or recommendations to draw color-sketches effectively. In this paper, we aim at a preliminary analysis to identify potential bottlenecks of a flexible color-sketch retrieval model. We present a formal framework based on position-color feature signatures, enabling comprehensive simulations of users drawing a color sketch.

Jakub Lokoč, Anh Nguyen Phuong, Marta Vomlelová, Chong-Wah Ngo

Applications

Frontmatter
Making Use of External Company Data to Improve the Classification of Bank Transactions

This project aims to explore to what extent external semantic resources on companies can be used to improve the accuracy of a real bank transaction classification system. The goal is to identify which implementations are best suited to exploit the additional company data retrieved from the Brønnøysund Registry and the Google Places API, and accurately measure the effects they have. The classification system builds on a Bag-of-Words representation and uses Logistic Regression as classification algorithm. This study suggests that enriching bank transactions with external company data substantially improves the accuracy of the classification system. If we compare the results obtained from our research to the baseline, which has an accuracy of 89.22%, the Brønnøysund Registry and Google Places API yield increases of 2.79pp and 2.01pp respectively. In combination, they generate an increase of 3.75pp.

Erlend Vollset, Eirik Folkestad, Marius Rise Gallala, Jon Atle Gulla
Mining Load Profile Patterns for Australian Electricity Consumers

The transformation from centralized and fossil-based electricity generation to distributed and renewable energy sources is an inevitable trend in the energy industry. One of the prime challenges in this transformation is the task of load/battery management, especially at the residential level. In solving this task, it is critical that a good strategy for analyzing and grouping residential electricity consumption patterns is in place so that further optimization strategies can be devised for different groups of consumers. Based on the real data from an Australian electricity retailer, we propose a clustering process to determine typical customer load profiles. It can be served as a standard framework for dealing with real-world unsupervised problems. In addition, some statistical techniques, including cumulative sum and calculation of the most frequent value in dataset by using mode, are integrated into our data preprocessing and analysis. CUSUM chart is a graphical method to clearly visualize as well as detect changes in time-series data and then using mode values is to replace missing values in the dataset. Furthermore, in our framework, more practical Elbow method is conducted to determine appropriated number of clusters for k-centers algorithm. We then apply multiple state-of-the-art clustering methods for time series data and benchmark their respective performance. We found that k-centers clustering techniques produces better results compared to exemplar-based methods. Additionally, choosing appropriated number of clusters for k-means can improve performance of clustering model. For example, k-means++ with $$k=2$$ has significantly outperformed other methods in our experiment.

Vanh Khuyen Nguyen, Wei Emma Zhang, Quan Z. Sheng, Jason Merefield
STA: A Spatio-Temporal Thematic Analytics Framework for Urban Ground Sensing

Urban planning has always involved getting feedback from various stakeholders and members of public, to inform plans and evaluation of proposals. A lot of rich information comes in textual forms, which traditionally have to be read manually. With advancements in machine learning capabilities, there is potential to tap on it to aid planners in synthesizing insights from large amount of textual feedback data more efficiently. In this paper, we developed a more general urban-centric feedback analysis framework, which encompasses the spatio-temporal thematic of ground sensing. Three essential methods: geotagging, topic modeling, and trend analysis are proposed and a prototype has been implemented. The results of experiments indicate that the proposed framework could not only accurately extract precise geospatial information, but also efficiently analyze the semantic themes based on a probabilistic topic modeling with Latent Dirichlet Allocation. Importantly, the spatial and temporal trends of detected topics indicate the effectiveness of our proposed algorithm and then benefit domain experts in their routine work and reveal many interesting insights on ground sensing matters.

Guizi Chen, Liang Yu, Wee Siong Ng, Huayu Wu, Usha Nanthani Kunasegaran
Privacy and Utility Preservation for Location Data Using Stay Region Analysis

Location data is very useful for providing location-based services to its users. But the problem in releasing this type of data is that sensitive and private information about users may be leaked. In [11] it is stated that even four spatio-temporal points are enough to uniquely identify 95% of the individuals. There are different approaches for privacy preservation of spatio-temporal data. But in most of these approaches, the utility is severely curtailed in order to ensure risk-free release of data. Our method made a few innovations to retain more utility while not compromising privacy. First of all, for each user we extracted stay regions which are places where a user spends significant amount of time. Then we extracted trajectories or trips between these stay regions. Now the whole data of very big trajectories is converted to trips where each trip has start and end times, and start and end lat-lons. We used these four dimensions in a round robin manner to k-anonymize each trip. We proposed two measures for estimating risk and utility. A nice feature of our method is visualization of k-anonymized trips which can give much better information about mass mobility within a city or area.

Manoranjan Dash, Sin G. Teo
Location-Aware Human Activity Recognition

In this paper, we present one of the winning solutions of an international human activity recognition challenge organized by DrivenData in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. The objective of the challenge was to predict activities of daily living and posture or ambulation based on wrist-worn accelerometer, RGB-D camera, and passive environmental sensor data, which was collected from a smart home in the UK. Most of the state of the art research focus on one type of data, e.g., wearable sensor data, for making predictions and overlook the usefulness of user locations for this purpose. In our work, we propose a novel approach that leverages heterogeneous data types as well as user locations for building predictive models. Note that while we do not have actual location information but we build models to predict location using machine learning models and use the predictions in user activity recognition. Compared to the state of the art, our proposed approach is able to achieve a 38% improvement with a Brier score of 0.1346. This means that roughly 9 out of 10 predictions matched the human-labeled descriptions.

Tam T. Nguyen, Daniel Fernandez, Quy T. K. Nguyen, Ebrahim Bagheri

Demos

Frontmatter
SWYSWYK: A New Sharing Paradigm for the Personal Cloud

Pushed by recent legislation and smart disclosure initiatives, the Personal Cloud emerges and holds the promise of giving the control back to the individual on her data. However, this shift leaves the privacy and security issues in user’s hands, a role that few people can properly endorse. This demonstration illustrates a new sharing paradigm, called SWYSWYK (Share What You See with Who You Know), dedicated to the Personal Cloud context. It allows each user to physically visualize the net effects of sharing rules and automatically provides tangible guarantees about the enforcement of the defined sharing policies. The usage and internals of SWYSWYK are demonstrated on a running prototype combining a commercial Personal Cloud platform, Cozy, and a secure hardware reference monitor, PlugDB.

Paul Tran-Van, Nicolas Anciaux, Philippe Pucheral
Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

We demonstrate EKG, a collection of tools and back-end infrastructure for creating custom, domain specific knowledge graphs. The toolkit is geared toward enterprises and government organizations where domain specific knowledge graphs are often not available. During the demo, audience members will be able to ingest their own documents and instantiate their own knowledge graphs and update them in real time. We will also present a demo app built using the toolkit consisting of more than 30 million entities and 192 million edges in order to demonstrate the kind of applications that could be built using the proposed toolkit. The app can be used to answer questions like who are the relevant persons named Steve in context of apple computers?, or who are the most important persons related to Barack Obama in context of healthcare reforms act? The functionalities of the toolkit are also exposed through REST APIs making it easier for developers to use the capabilities in their own applications.

Sumit Bhatia, Nidhi Rajshree, Anshu Jain, Nitish Aggarwal
An Interactive Web-Based Toolset for Knowledge Discovery from Short Text Log Data

Many companies maintain human-written logs to capture data on events such as workplace incidents and equipment failures. However, the sheer volume and unstructured nature of this data prevent it from being utilised for knowledge acquisition. Our web-based prototype software system provides a cohesive computational methodology for analysing and visualising log data that requires minimal human involvement. It features an interface to support customisable, modularised log data processing and knowledge discovery. This enables owners of event-based datasets containing short textual descriptions, such as occupational health & safety officers and machine operators, to identify latent knowledge not previously acquirable without significant time and effort. The software system comprises five distinct stages, corresponding to standard data mining milestones: exploratory analysis, data warehousing, association rule mining, entity clustering, and predictive analysis. To the best of our knowledge, it is the first dedicated system to computationally analyse short text log data and provides a powerful interface that visualises the analytical results and supports human interaction.

Michael Stewart, Wei Liu, Rachell Cardell-Oliver, Mark Griffin
Carbon: Forecasting Civil Unrest Events by Monitoring News and Social Media

Societal security has been receiving unprecedented attention over the past decade because of the ubiquity of online public data sources. Much research effort has been taken to detect relevant societal issues. However, forecasting them is more challenging but greatly beneficial to the entire society. In this paper, we present a forecasting system named Carbon to predict civil unrest events, e.g., protests and strikes. Two predictive models are implemented and scheduled to make predictions periodically. One model forecasts through the analysis of historical civil unrest events reported by news portals, while the other functions by detecting and integrating early clues from social media contents. With our web UI and visualisation, users can easily explore the predicted events and their spatiotemporal distribution. The demonstration will exemplify that Carbon can greatly benefit the society such that the general public can be alerted in advance to avoid potential dangers and that the authorities can take proactive actions to alleviate tensions and reduce possible damage to the society.

Wei Kang, Jie Chen, Jiuyong Li, Jixue Liu, Lin Liu, Grant Osborne, Nick Lothian, Brenton Cooper, Terry Moschou, Grant Neale
A System for Querying and Analyzing Urban Regions

We develop a new interactive visualization system to support the visualization of different aspects of a region as well as querying and analyzing similar land uses. The main contributions of this work include an urban region query framework and an analysis of using the social media and traditional census data to identify similar areas. Two example potential applications of our system are given as follows: (1) Our system can be employed by city planners to analyze and observe identical land uses, assisting them in making development objectives; (2) Our system can also be used by business owners to observe similar urban regions for potential areas of expansion.

Wee Boon Koh, Xiaolei Li
Detect Tracking Behavior Among Trajectory Data

Due to the continuing improvements in location acquisition technology, a large population of GPS-equipped moving objects are tracked in a server. In emergency applications, users may want to detect whether a target is tracked by another object. We formulate the tracking behavior by continuous distance queries in trajectory databases. Index structures are developed to improve the query performance. Using real trajectories, we demonstrate answering continuous distance queries in a database system and animating moving objects fulfilling the distance condition in the user interface. The result benefits mining the interesting behavior among trajectory data and answering distance join queries.

Jianqiu Xu, Jiangang Zhou
Backmatter
Metadaten
Titel
Advanced Data Mining and Applications
herausgegeben von
Gao Cong
Wen-Chih Peng
Wei Emma Zhang
Chengliang Li
Dr. Aixin Sun
Copyright-Jahr
2017
Electronic ISBN
978-3-319-69179-4
Print ISBN
978-3-319-69178-7
DOI
https://doi.org/10.1007/978-3-319-69179-4

Premium Partner