Skip to main content
Top

2017 | Book

Advances in Knowledge Discovery and Data Mining

21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings, Part I

Editors: Jinho Kim, Kyuseok Shim, Longbing Cao, Jae-Gil Lee, Xuemin Lin, Yang-Sae Moon

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This two-volume set, LNAI 10234 and 10235, constitutes the thoroughly refereed proceedings of the 21st Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2017, held in Jeju, South Korea, in May 2017.

The 129 full papers were carefully reviewed and selected from 458 submissions. They are organized in topical sections named: classification and deep learning; social network and graph mining; privacy-preserving mining and security/risk applications; spatio-temporal and sequential data mining; clustering and anomaly detection; recommender system; feature selection; text and opinion mining; clustering and matrix factorization; dynamic, stream data mining; novel models and algorithms; behavioral data mining; graph clustering and community detection; dimensionality reduction.

Table of Contents

Frontmatter

Classification and Deep Learning

Frontmatter
Convolutional Bi-directional LSTM for Detecting Inappropriate Query Suggestions in Web Search

A web search query is considered inappropriate if it may cause anger, annoyance to certain users or exhibits lack of respect, rudeness, discourteousness towards certain individuals/communities or may be capable of inflicting harm to oneself or others. A search engine should regulate its query completion suggestions by detecting and filtering such queries as it may hurt the user sentiments or may lead to legal issues thereby tarnishing the brand image. Hence, automatic detection and pruning of such inappropriate queries from completions and related search suggestions is an important problem for most commercial search engines. The problem is rendered difficult due to unique challenges posed by search queries such as lack of sufficient context, natural language ambiguity and presence of spelling mistakes and variations.In this paper, we propose a novel deep learning based technique for automatically identifying inappropriate query suggestions. We propose a novel deep learning architecture called “Convolutional Bi-Directional LSTM (C-BiLSTM)” which combines the strengths of both Convolution Neural Networks (CNN) and Bi-directional LSTMs (BLSTM). Given a query, C-BiLSTM uses a convolutional layer for extracting feature representations for each query word which is then fed as input to the BLSTM layer which captures the various sequential patterns in the entire query and outputs a richer representation encoding them. The query representation thus learnt passes through a deep fully connected network which predicts the target class. C-BiLSTM doesn’t rely on hand-crafted features, is trained end-end as a single model, and effectively captures both local features as well as their global semantics. Evaluating C-BiLSTM on real-world search queries from a commercial search engine reveals that it significantly outperforms both pattern based and other hand-crafted feature based baselines. Moreover, C-BiLSTM also performs better than individual CNN, LSTM and BLSTM models trained for the same task.

Harish Yenala, Manoj Chinnakotla, Jay Goyal
A Fast and Easy Regression Technique for k-NN Classification Without Using Negative Pairs

This paper proposes an inexpensive way to learn an effective dissimilarity function to be used for k-nearest neighbor (k-NN) classification. Unlike Mahalanobis metric learning methods that map both query (unlabeled) objects and labeled objects to new coordinates by a single transformation, our method learns a transformation of labeled objects to new points in the feature space whereas query objects are kept in their original coordinates. This method has several advantages over existing distance metric learning methods: (i) In experiments with large document and image datasets, it achieves k-NN classification accuracy better than or at least comparable to the state-of-the-art metric learning methods. (ii) The transformation can be learned efficiently by solving a standard ridge regression problem. For document and image datasets, training is often more than two orders of magnitude faster than the fastest metric learning methods tested. This speed-up is also due to the fact that the proposed method eliminates the optimization over “negative” object pairs, i.e., objects whose class labels are different. (iii) The formulation has a theoretical justification in terms of reducing hubness in data.

Yutaro Shigeto, Masashi Shimbo, Yuji Matsumoto
Deep Network Regularization via Bayesian Inference of Synaptic Connectivity

Deep neural networks (DNNs) often require good regularizers to generalize well. Currently, state-of-the-art DNN regularization techniques consist in randomly dropping units and/or connections on each iteration of the training algorithm. Dropout and DropConnect are characteristic examples of such regularizers, that are widely popular among practitioners. However, a drawback of such approaches consists in the fact that their postulated probability of random unit/connection omission is a constant that must be heuristically selected based on the obtained performance in some validation set. To alleviate this burden, in this paper we regard the DNN regularization problem from a Bayesian inference perspective: We impose a sparsity-inducing prior over the network synaptic weights, where the sparsity is induced by a set of Bernoulli-distributed binary variables with Beta (hyper-)priors over their prior parameters. This way, we eventually allow for marginalizing over the DNN synaptic connectivity for output generation, thus giving rise to an effective, heuristics-free, network regularization scheme. We perform Bayesian inference for the resulting hierarchical model by means of an efficient Black-Box Variational inference scheme. We exhibit the advantages of our method over existing approaches by conducting an extensive experimental evaluation using benchmark datasets.

Harris Partaourides, Sotirios P. Chatzis
Adaptive One-Class Support Vector Machine for Damage Detection in Structural Health Monitoring

Machine learning algorithms have been employed extensively in the area of structural health monitoring to compare new measurements with baselines to detect any structural change. One-class support vector machine (OCSVM) with Gaussian kernel function is a promising machine learning method which can learn only from one class data and then classify any new query samples. However, generalization performance of OCSVM is profoundly influenced by its Gaussian model parameter $$\sigma $$. This paper proposes a new algorithm named Appropriate Distance to the Enclosing Surface (ADES) for tuning the Gaussian model parameter. The semantic idea of this algorithm is based on inspecting the spatial locations of the edge and interior samples, and their distances to the enclosing surface of OCSVM. The algorithm selects the optimal value of $$\sigma $$ which generates a hyperplane that is maximally distant from the interior samples but close to the edge samples. The sets of interior and edge samples are identified using a hard margin linear support vector machine. The algorithm was successfully validated using sensing data collected from the Sydney Harbour Bridge, in addition to five public datasets. The designed ADES algorithm is an appropriate choice to identify the optimal value of $$\sigma $$ for OCSVM especially in high dimensional datasets.

Ali Anaissi, Nguyen Lu Dang Khoa, Samir Mustapha, Mehrisadat Makki Alamdari, Ali Braytee, Yang Wang, Fang Chen
A Classification Model for Diverse and Noisy Labelers

With the popularity of the Internet and crowdsourcing, it becomes easier to obtain labeled data for specific problems. Therefore, learning from data labeled by multiple annotators has become a common scenario these days. Since annotators have different expertise, labels acquired from them might not be perfectly accurate. This paper derives an optimization framework to solve this task through estimating the expertise of each annotator and the labeling difficulty for each instance. In addition, we introduce similarity metric to enable the propagation of annotations between instances.

Hao-En Sung, Cheng-Kuan Chen, Han Xiao, Shou-De Lin
Automatic Discovery of Common and Idiosyncratic Latent Effects in Multilevel Regression

We present a flexible non-parametric generative model for multilevel regression that strikes an automatic balance between identifying common effects across groups while respecting their idiosyncrasies. The model is built using techniques that are now considered standard in the statistical parameter estimation literature, namely, Hierarchical Dirichlet processes (HDP) and Hierarchical Generalized Linear Models (HGLM), and therefore, we name it “Infinite Mixtures of Hierarchical Generalized Linear Models” (iHGLM). We demonstrate how the use of a HDP prior in local, groupwise GLM modeling of response-covariate densities allows iHGLM to capture latent similarities and differences within and across groups. We demonstrate iHGLM’s superior accuracy in comparison to well known competing methods like Generalized Linear Mixed Model (GLMM), Regression Tree, Least Square Regression, Bayesian Linear Regression, Ordinary Dirichlet Process Regression, and several other regression models on several synthetic and real world datasets.

Sk Minhazul Islam, Arunvava Banerjee
A Deep Neural Network for Pairwise Classification: Enabling Feature Conjunctions and Ensuring Symmetry

Pairwise classification is a computational problem to determine whether a given ordered pair of objects satisfies a binary relation R which is specified implicitly by a set of training data used for ‘learning’ R. It is an important component for entity resolution, network link prediction, protein-protein interaction prediction, and so on. Although deep neural networks (DNNs) outperform other methods in many tasks and have thus attracted the attention of machine learning researchers, there have been few studies of applying a DNN to pairwise classification. Important properties of pairwise classification include using feature conjunctions across examples. Also, it is known that making the classifier invariant to the data order is an important property in applications with a symmetric relation R, including those applications mentioned above. We first show that a simple DNN with fully connected layers cannot satisfy these properties and then present a pairwise DNN satisfying these properties. As an example of pairwise classification, we use the author matching problem, which is the problem of determining whether two author names in different bibliographic data sources refer to the same person. We show that the method using our model outperforms methods using a support vector machine and simple DNNs.

Kyohei Atarashi, Satoshi Oyama, Masahito Kurihara, Kazune Furudo
Feature Ranking of Large, Robust, and Weighted Clustering Result

A clustering result needs to be interpreted and evaluated for knowledge discovery. When clustered data represents a sample from a population with known sample-to-population alignment weights, both the clustering and the evaluation techniques need to take this into account. The purpose of this article is to advance the automatic knowledge discovery from a robust clustering result on the population level. For this purpose, we derive a novel ranking method by generalizing the computation of the Kruskal-Wallis H test statistic from sample to population level with two different approaches. Application of these enlargements to both the input variables used in clustering and to metadata provides automatic determination of variable ranking that can be used to explain and distinguish the groups of population. The ranking method is illustrated with an open data and then, applied to advance the educational knowledge discovery from large-scale international student assessment data, whose robust clustering into disjoint groups on three different levels of abstraction was performed in [19].

Mirka Saarela, Joonas Hämäläinen, Tommi Kärkkäinen
Purchase Signatures of Retail Customers

In the retail context, there is an increasing need for understanding individual customer behavior in order to personalize marketing actions. We propose the novel concept of customer signature, that identifies a set of important products that the customer refills regularly. Both the set of products and the refilling time periods give new insights on the customer behavior. Our approach is inspired by methods from the domain of sequence segmentation, thus benefiting from efficient exact and approximate algorithms. Experiments on a real massive retail dataset show the interest of the signatures for understanding individual customers.

Clement Gautrais, René Quiniou, Peggy Cellier, Thomas Guyet, Alexandre Termier
Volatility Adaptive Classifier System

A data stream’s concept may evolve over time, which is known as the concept drift. Concept drifts affect the prediction accuracy of the learning model and are required to be handled to maintain the model quality. In most cases, there is a trade-off between maintaining prediction quality and learning efficiency. We present a novel framework known as the Volatility-Adaptive Classifier System (VACS) to balance the trade-off. The system contains an adaptive classifier and a non-adaptive classifier. The former can maintain a higher prediction quality but requires additional computational overhead, and the latter requires less computational overhead but its prediction quality may be susceptible to concept drifts. VACS automatically applies the adaptive classifier when the concept drifts are frequent, and switches to the non-adaptive classifier when drifts are infrequent. As a result, VACS can maintain a relatively low computational cost while still maintaining a high enough overall prediction quality. To the best of our knowledge, this is the first data stream mining framework that applies different learners to reduce the learning overheads.

Ruolin Jia, Yun Sing Koh, Gillian Dobbie
Distributed Representations for Words on Tables

We consider a problem of word embedding for tables, and we obtain distributed representations for words found in tables. We propose a table word-embedding method, which considers both horizontal and vertical relations between cells to estimate appropriate word embedding for words in tables. We propose objective functions that make use of horizontal and vertical relations, both individually and jointly.

Minoru Yoshida, Kazuyuki Matsumoto, Kenji Kita
Link Prediction for Isolated Nodes in Heterogeneous Network by Topic-Based Co-clustering

This paper presents a new probabilistic generative model (PGM) that predicts links for isolated nodes in a heterogeneous network using textual data. In conventional PGMs, a link between two nodes is predicted on the basis of the nodes’ other existing links. This method makes it difficult to predict links for isolated nodes, which happens when new items are recommended. In this study, we first naturally expand the relational topic model (RTM) to a heterogeneous network (Hetero-RTM). However, this simple extension degrades performance in a link prediction for existing nodes. We present a new model called the Grouped Hetero-RTM that has both latent topics and latent clusterings. Through intensive experiments that simulate real recommendation problems, the Grouped Hetero-RTM outperforms baseline methods at predicting links for isolated nodes. This model, furthermore, performs as effectively as the stochastic block model in the link prediction for existing nodes. We also find that the Grouped Hetero-RTM is effective for various textual data such as item reviews and movie descriptions.

Katsufumi Tomobe, Masafumi Oyamada, Shinji Nakadai
Predicting Destinations from Partial Trajectories Using Recurrent Neural Network

Predicting a user’s destinations from his or her partial movement trajectories is still a challenging problem. To this end, we employ recurrent neural networks (RNNs), which can consider long-term dependencies and avoid a data sparsity problem. This is because the RNNs store statistical weights for long-term transitions in location sequences unlike conventional Markov process-based methods that count the number of short-term transitions. However, how to apply the RNNs to the destination prediction is not straight-forward, and thus we propose an efficient and accurate method for this problem. Specifically, our method represents trajectories as discretized features in a grid space and feeds sequences of them to the RNN model, which estimates the transition probabilities in the next timestep. Using these one-step transition probabilities, the visiting probabilities for the destination candidates are efficiently estimated by simulating the movements of objects based on stochastic sampling with an RNN encoder-decoder framework. We evaluate the proposed method on two different real datasets, i.e., taxi and personal trajectories. The results demonstrate that our method can predict destinations more accurately than state-of-the-art methods.

Yuki Endo, Kyosuke Nishida, Hiroyuki Toda, Hiroshi Sawada
Preventing Inadvertent Information Disclosures via Automatic Security Policies

Enterprises constantly share and exchange digital documents with sensitive information both within the organization and with external partners/customers. With the increase in digital data sharing, data breaches have also increased significantly resulting in sensitive information being accessed by unintended recipients. To protect documents against such unauthorized access, the documents are assigned a security policy which is a set of users and information about their access permissions on the document. With the surge in the volume of digital documents, manual assignment of security policies is infeasible and error prone calling for an automatic policy assignment. In this paper, we propose an algorithm that analyzes the sensitive information and historic access permissions to identify content-access correspondence via a novel multi-label classifier formulation. The classifier thus modeled is capable of recommending policies/access permissions for any new document. Comparisons with existing approaches in this space shows superior performance with the proposed framework across several evaluation criteria.

Tanya Goyal, Sanket Mehta, Balaji Vasan Srinivasan
Personalized Deep Learning for Tag Recommendation

Social media services deploy tag recommendation systems to facilitate the process of tagging objects which depends on the information of both the user’s preferences and the tagged object. However, most image tag recommender systems do not consider the additional information provided by the uploaded image but rely only on textual information, or make use of simple low-level image features. In this paper, we propose a personalized deep learning approach for the image tag recommendation that considers the user’s preferences, as well as visual information. We employ Convolutional Neural Networks (CNNs), which already provide excellent performance for image classification and recognition, to obtain visual features from images in a supervised way. We provide empirical evidence that features selected in this fashion improve the capability of tag recommender systems, compared to the current state of the art that is using hand-crafted visual features, or is solely based on the tagging history information. The proposed method yields up to at least two percent accuracy improvement in two real world datasets, namely NUS-WIDE and Flickr-PTR.

Hanh T. H. Nguyen, Martin Wistuba, Josif Grabocka, Lucas Rego Drumond, Lars Schmidt-Thieme
Information-Theoretic Non-redundant Subspace Clustering

A comprehensive understanding of complex data requires multiple different views. Subspace clustering methods open up multiple interesting views since they support data objects to be assigned to different clusters in different subspaces. Conventional subspace clustering methods yield many redundant clusters or control redundancy by difficult to set parameters. In this paper, we employ concepts from information theory to naturally trade-off the two major properties of a subspace cluster: The quality of a cluster and its redundancy with respect to the other clusters. Our novel algorithm NORD (for NOn-ReDundant) efficiently discovers the truly relevant clusters in complex data sets without requiring any kind of threshold on their redundancy. NORD also exploits the concept of microclusters to support the detection of arbitrarily-shaped clusters. Our comprehensive experimental evaluation shows the effectiveness and efficiency of NORD on both synthetic and real-world data sets and provides a meaningful visualization of both the quality and the degree of the redundancy of the clustering result on first glance.

Nina Hubig, Claudia Plant
Cost Matters: A New Example-Dependent Cost-Sensitive Logistic Regression Model

Connectivity and automation are evermore part of today’s cars. To provide automation, many gauges are integrated in cars to collect physical readings. In the automobile industry, the gathered multiple datasets can be used to predict whether a car repair is needed soon. This information gives drivers and retailers helpful information to take action early. However, prediction in real use cases shows new challenges: misclassified instances have not equal but different costs. For example, incurred costs for not predicting a necessarily needed tire change are usually higher than predicting a tire change even though the car could still drive thousands of kilometers. To tackle this problem, we introduce a new example-dependent cost sensitive prediction model extending the well-established idea of logistic regression. Our model allows different costs of misclassified instances and obtains prediction results leading to overall less cost. Our method consistently outperforms the state-of-the-art in example-dependent cost-sensitive logistic regression on various datasets. Applying our methods to vehicle data from a large European car manufacturer, we show cost savings of about 10%.

Nikou Günnemann, Jürgen Pfeffer

Social Network and Graph Mining

Frontmatter
Beyond Assortativity: Proclivity Index for Attributed Networks (ProNe)

If Alice is majoring in Computer Science, can we guess the major of her friend Bob? Even harder, can we determine Bob’s age or sexual orientation? Attributed graphs are ubiquitous, occurring in a wide variety of domains; yet there is limited literature on the study of the interplay between the attributes associated to nodes and edges connecting them. Our work bridges this gap by addressing the following questions: Given the network structure, (i) which attributes and (ii) which pairs of attributes show correlation? Prior work has focused on the first part, under the name of assortativity (closely related to homophily). In this paper, we propose ProNe, the first measure to handle pairs of attributes (e.g., major and age). The proposed ProNe is (a) thorough, handling both homophily and heterophily (b) general, quantifying correlation of a single attribute or a pair of attributes (c) consistent, yielding a zero score in the absence of any structural correlation. Furthermore, ProNe can be computed fast in time linear in the network size and is highly useful, with applications in data imputation, marketing, personalization and privacy protection.

Reihaneh Rabbany, Dhivya Eswaran, Artur W. Dubrawski, Christos Faloutsos
Hierarchical Mixed Neural Network for Joint Representation Learning of Social-Attribute Network

Most existing network representation learning (NRL) methods are designed for homogeneous network, which only consider topological properties of networks. However, in real-world networks, text or categorical attributes are usually associated with nodes, providing another description for networks in a different perspective.In this paper, we present a joint learning approach which learns the representations of nodes and attributes in the same low-dimensional vector space simultaneously. Particularly, we show that more discriminative node representations can be acquired by leveraging attribute features. The experiments conducted on three social-attribute network datasets demonstrate that our model outperforms several state-of-the-art baselines significantly for node classification task and network visualization task.

Weizheng Chen, Jinpeng Wang, Zhuoxuan Jiang, Yan Zhang, Xiaoming Li
Cost-Effective Viral Marketing in the Latency Aware Independent Cascade Model

A time-constrained viral marketing campaign allows a business to promote a product or event to social network users within a certain time duration. To perform a time-constrained campaign, existing works select the duration of the campaign, and then a set of k seeds that maximize the spread (expected number of users to which the product or event is promoted) for the selected duration. In practice, however, there are many alternative durations, which determine the monetary cost of the campaign and lead to seeds with substantially different spread. In this work, we aim to select the duration of the campaign and a set of k seeds, so that the campaign has the maximum spread-to-cost ratio (i.e., cost-effectiveness). We formulate this task as an optimization problem, under the LAIC information diffusion model. The problem is challenging to solve efficiently, particularly when there are many alternative durations. Thus, we develop an approximation algorithm that employs dynamic programming to compute the spread of seeds for several possible durations simultaneously. We also introduce a new optimization technique that is able to provide an additional performance speed-up by pruning durations that cannot lead to a solution. Experiments on real and synthetic data show the effectiveness and efficiency of our algorithm.

Robert Gwadera, Grigorios Loukides
DSBPR: Dual Similarity Bayesian Personalized Ranking

Modern social recommendation has been steadily receiving more attention, which utilizes social relations among users to improve the efficiency of recommendation. However, most social recommendation methods only consider simple similarity information of users as social regularization and ignore the improvement of predictors of people’s opinions. Meanwhile, due to the simple characteristics of data in various applications, previous works mostly leverage pointwise methods based on absolute rating assumption to solve the problem. In this paper, we propose a novel Dual Similarity Bayesian Personalized Ranking model to incorporate the similarity information of users and items into our preference predictor function. Having improved the preference predictor, we employ Bayesian Personalized Ranking model as training procedure which is a pairwise method. Empirical results on three public datasets show that our proposed model is an efficient algorithm compared with the state-of-the-art methods.

Longfei Shi, Bin Wu, Jing Zheng, Chuan Shi, Mengxin Li
Usage Based Tag Enhancement of Images

Appropriate tagging of images is at the heart of efficient recommendation and retrieval and is used for indexing image content. Existing technologies in image tagging either focus on what the image contains based on a visual analysis or utilize the tags from the textual content accompanying the images as the image tags. While the former is insufficient to get a complete understanding of how the image is perceived and used in various context, the latter results in a lot of irrelevant tags particularly when the accompanying text is large. To address this issue, we propose an algorithm based on graph-based random walk that extracts only image-relevant tags from the accompanying text. We perform detailed evaluation of our scheme by checking its viability using human annotators as well as by comparing with state-of-the art algorithms. Experimental results show that the proposed algorithm outperforms base-line algorithms with respect to different metrics.

Balaji Vasan Srinivasan, Noman Ahmed Sheikh, Roshan Kumar, Saurabh Verma, Niloy Ganguly
Edge Role Discovery via Higher-Order Structures

Previous work in network analysis has focused on modeling the roles of nodes in graphs. In this paper, we introduce edge role discovery and propose a framework for learning and extracting edge roles from large graphs. We also propose a general class of higher-order role models that leverage network motifs. This leads us to develop a novel edge feature learning approach for role discovery that begins with higher-order network motifs and automatically learns deeper edge features. All techniques are parallelized and shown to scale well. They are also efficient with a time complexity of $$\mathcal {O}(|E|)$$. The experiments demonstrate the effectiveness of our model for a variety of ML tasks such as improving classification and dynamic network analysis.

Nesreen K. Ahmed, Ryan A. Rossi, Theodore L. Willke, Rong Zhou
Efficient Bi-level Variable Selection and Application to Estimation of Multiple Covariance Matrices

Variable selection plays an important role in analyzing high dimensional data. When the data possesses certain group structures in which individual variables are also meaningful scientifically, we are naturally interested in selecting important groups as well as important variables. We introduce a new regularization by combining the $$\ell _{p,0}$$-norm and $$\ell _0$$-norm for bi-level variable selection. Using an appropriate DC (Difference of Convex functions) approximation, the resulting problem can be solved by DC Algorithm. As an application, we implement the proposed algorithm for estimating multiple covariance matrices sharing some common structures such as the locations or weights of non-zero elements. The experimental results on both simulated and real datasets demonstrate the efficiency of our algorithm.

Duy Nhat Phan, Hoai An Le Thi, Dinh Tao Pham
Entity Set Expansion with Meta Path in Knowledge Graph

Entity set expansion (ESE) is the problem that expands a small set of seed entities into a more complete set, entities of which have common traits. As a popular data mining task, ESE has been widely used in many applications, such as dictionary construction and query suggestion. Contemporary ESE mainly utilizes text and Web information. That is, the intrinsic relation among entities is inferred from their occurrences in text or Web. With the surge of knowledge graph in recent years, it is possible to extend entities according to their occurrences in knowledge graph. In this paper, we consider the knowledge graph as a heterogeneous information network (HIN) that contains different types of objects and links, and propose a novel method, called MP_ESE, to extend entities in the HIN. The MP_ESE employs meta paths, a relation sequence connecting entities, in HIN to capture the implicit common traits of seed entities, and an automatic meta path generation method, called SMPG, is provided to exploit the potential relations among entities. With these generated and weighted meta paths, the MP_ESE can effectively extend entities. Experiments on real datasets validate the effectiveness of MP_ESE.

Yuyan Zheng, Chuan Shi, Xiaohuan Cao, Xiaoli Li, Bin Wu
Using Network Flows to Identify Users Sharing Extremist Content on Social Media

Social media has been leveraged by many groups to share their ideas, ideology, and other messages. Some of these posts promote extremist ideology. In this paper, we propose an approach for identifying users who engage in extremist discussions online. Our approach uses detailed feature selection to identify relevant posts and then uses a novel weighted network that models the information flow between the publishers of the relevant posts. An empirical evaluation of a post collection crawled from a web forum containing racially driven discussions and a tweet stream discussing the ISIS extremist group show that our proposed method for relevant post identification is significantly better than the state of the art and using a network flow graph for user identification leads to very accurate user identification.

Yifang Wei, Lisa Singh
MC3: A Multi-class Consensus Classification Framework

In this paper, we propose MC3, an ensemble framework for multi-class classification. MC3 is built on “consensus learning”, a novel learning paradigm where each individual base classifier keeps on improving its classification by exploiting the outcomes obtained from other classifiers until a consensus is reached. Based on this idea, we propose two algorithms, MC3-R and MC3-S that make different trade-offs between quality and runtime. We conduct rigorous experiments comparing MC3-R and MC3-S with 12 baseline classifiers on 13 different datasets. Our algorithms perform as well or better than the best baseline classifier, achieving on average, a 5.56% performance improvement. Moreover, unlike existing baseline algorithms, our algorithms also improve the performance of individual base classifiers up to 10%. (The code is available at https://github.com/MC3-code.)

Tanmoy Chakraborty, Des Chandhok, V. S. Subrahmanian
Monte Carlo Based Incremental PageRank on Evolving Graphs

Computing PageRank for enormous and frequently evolving real-world network consumes sizable resource and comes with large computational overhead. To address this problem, IMCPR, an incremental PageRank algorithm based on Monte Carlo method is proposed in this paper. IMCPR computes PageRank scores via updating previous results accumulatively according to the changed part of network, instead of recomputing from scratch. IMCPR effectively improves the performance and brings no additional storage overhead. Theoretical analysis shows that the time complexity of IMCPR to update PageRank scores for a network with m changed nodes and n changed edges is O((m+n/c)/c), where c is reset probability. It takes O(1) works to update PageRank scores as inserting/removing a node or edge. The time complexity of IMCPR is better than other existing state-of-art algorithms for most real-world graphs. We evaluate IMCPR with real-world networks from different backgrounds upon Hama, a distributed platform. Experiments demonstrate that IMCPR obtains PageRank scores with equal (or even higher) accuracy as the baseline Monte Carlo based PageRank algorithm and reduces the amount of computation significantly compared to other existing incremental algorithm.

Qun Liao, ShuangShuang Jiang, Min Yu, Yulu Yang, Tao Li
Joint Weighted Nonnegative Matrix Factorization for Mining Attributed Graphs

Graph clustering has been extensively studied in the past decades, which can serve many real world applications, such as community detection, big network management and protein network analysis. However, the previous studies focus mainly on clustering with graph topology information. Recently, as the advance of social networks and Web 2.0, many graph datasets produced contain both the topology and node attribute information, which are known as attributed graphs. How to effectively utilize the two types of information for clustering thus becomes a hot research topic. In this paper, we propose a new attributed graph clustering method, JWNMF, which integrates topology structure and node attributes by a new collective nonnegative matrix factorization method. On the one hand, JWNMF employs a factorization for topology structure. On the other hand, it designs a weighted factorization for nodes’ attributes, where the weights are automatically determined to discriminate informative and uninformative attributes for clustering. Experimental results on seven real-world datasets show that our method significantly outperforms state-of-the-art attributed graph clustering methods.

Zhichao Huang, Yunming Ye, Xutao Li, Feng Liu, Huajie Chen
Predicting Happiness State Based on Emotion Representative Mining in Online Social Networks

Online social networks (OSNs) have become a major platform for people to obtain information and to interact with their friends. People tend to post their thoughts and activities online and share their emotions with friends, which provides a good opportunity to study the role of online social networks in happiness spreading and mutual influence among the users. In this paper, we propose a framework to study the influence of happiness in OSNs. We first quantify the happiness states of users by analyzing their daily posting texts, and then conduct the statistical analysis to show that users’ happiness states are influenced by their social network neighbors. Since the influence of each individual is unequal, we develop a regression model and a greedy algorithm to detect the high influence users known as emotion representatives. By using a small number of detected emotion representatives as features to train prediction models, we show that it achieves good performance in predicting the happiness states of the whole online social network users.

Xiao Zhang, Wenzhong Li, Hong Huang, Cam-Tu Nguyen, Xu Chen, Xiaoliang Wang, Sanglu Lu
Exploring Celebrities on Inferring User Geolocation in Twitter

Location information of social media users provides crucial context to monitor real-time events such as natural disasters, terrorism and epidemics. Since only a small amount of social media data are geotagged, inference techniques play a substantial role to predict user spatial locations by incorporating characteristics of their behavior. Based on utilized source of information, related works are divided into text-based (based on text posted by users), network-based (based on the friendship network), and some hybrid methods. In this paper, we propose a novel approach based on the notion of celebrities to infer the location of Twitter users. We categorize highly-mentioned users (celebrities) into local and global, and consequently utilize local celebrities as a major location indicator for inference. A label propagation algorithm is then utilized over a refined social network for geolocation inference. Finally, we propose a hybrid approach by merging a text-based method as a back-off strategy into our network-based approach. Empirical experiments using three standard Twitter benchmark datasets demonstrate the superior performance of our approach over the state-of-the-art methods.

Mohammad Ebrahimi, Elaheh ShafieiBavani, Raymond Wong, Fang Chen
Do Rumors Diffuse Differently from Non-rumors? A Systematically Empirical Analysis in Sina Weibo for Rumor Identification

With the prosperity of social media, online rumors become a severe social problem, which often lead to serious consequences, e.g., social panic and even chaos. Therefore, how to automatically identify rumors in social media has attracted much research attention. Most existing studies address this problem by extracting features from the contents of rumors and their reposts as well as the users involved. For these features, especially diffusion features, these works ignore systematic analysis and the exploration of difference between rumors and non-rumors, which exert targeted effect on rumor identification. In this paper, we systematically investigate this problem from a diffusion perspective using Sina Weibo data. We first extract a group of new features from the diffusion processes of messages and then make a few important observations on them. Based on these features, we develop classifiers to discriminate rumors and non-rumors. Experimental comparisons with the state-of-the-arts methods demonstrate the effectiveness of these features.

Yahui Liu, Xiaolong Jin, Huawei Shen, Xueqi Cheng
Investigating the Dynamics of Religious Conflicts by Mining Public Opinions on Social Media

The powerful emergence of religious faith and beliefs within political and social groups, now leading to discrimination and violence against other communities has become an important problem for the government and law enforcement agencies. In this paper, we address the challenges and gaps of offline surveys by mining the public opinions, sentiments and beliefs shared about various religions and communities. Due to the presence of descriptive posts, we conduct our experiments on Tumblr website- the second most popular microblogging service. Based on our survey among 3 different groups of 60 people, we define 11 dimensions of public opinion and beliefs that can identify the contrast of conflict in religious posts. We identify various linguistic features of Tumblr posts using topic modeling and linguistic inquiry and word count. We investigate the efficiency of dimensionality reduction techniques and semi-supervised classification methods for classifying the posts into various dimensions of conflicts. Our results reveal that linguistic features such as emotions, language variables, personality traits, social process, and informal language are the discriminatory features for identifying the dynamics of conflict in religious posts.

Swati Agarwal, Ashish Sureka
Mining High-Utility Itemsets with Both Positive and Negative Unit Profits from Uncertain Databases

Some important limitation of frequent itemset mining are that it assumes that each item cannot appear more than once in each transaction, and all items have the same importance (weight, cost, risk, unit profit or value). These assumptions often do not hold in real-world applications. For example, consider a database of customer transactions containing information about the purchase quantities of items in each transaction and the positive or negative unit profit of each item. Besides, uncertainty is commonly embedded in collected data in real-life applications. To address this issue, we propose an efficient algorithm named HUPNU (mining High-Utility itemsets with both Positive and Negative unit profits from Uncertain databases), the high qualified patterns can be discovered effectively for decision-making. Based on the designed vertical PU$$^{\pm }$$-list (Probability-Utility list with Positive-and-Negative profits) structure and several pruning strategies, HUPNU can directly discovers the potential high-utility itemsets without generating candidates.

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Vincent S. Tseng
Sparse Stochastic Inference with Regularization

The massive amount of digital text information and delivering them in streaming manner pose challenges for traditional inference algorithms. Recently, advances in stochastic inference algorithms have made it feasible to learn topic models from very large-scale collections of documents. In this paper, we however point out that many existing approaches are prone to overfitting for extremely large/infinite datasets. The possibility of overfitting is particularly high in streaming environments. This finding suggests to use regularization for stochastic inference. We then propose a novel stochastic algorithm for learning latent Dirichlet allocation that uses regularization when updating global parameters and utilizes sparse Gibb sampling to do local inference. We study the performance of our algorithm on two massive data sets and demonstrate that it surpasses the existing algorithms in various aspects.

Tung Doan, Khoat Than
Exploring Check-in Data to Infer Social Ties in Location Based Social Networks

Social Networking Services (SNS), such as Facebook, Twitter, and Foursquare, allow users to perform check-in and share their location data. Given the check-in data records, we can extract the features (e.g., the spatial-temporal features) to infer the social ties. The challenge of this inference task is to differentiate between real friends and strangers by solely observing their mobility patterns. In this paper, we explore the meeting events or co-occurrences from users’ check-in data. We derive three key features from users’ meeting events and propose a framework called SCI framework (Social Connection Inference framework) which integrates all derived features to differentiate coincidences from real friends’ meetings. Extensive experiments on two location-based social network datasets show that the proposed SCI framework can outperform the state-of-the-art method.

Gunarto Sindoro Njoo, Min-Chia Kao, Kuo-Wei Hsu, Wen-Chih Peng
Scalable Twitter User Clustering Approach Boosted by Personalized PageRank

Twitter has been the focus of analysis in recent years due to various interesting and challenging problems, one of them being Clustering of its Users based on their interests. For graphs, there are many clustering approaches which look at either the structure or at its contents. However, when we consider real world data such as Twitter Data, structural approaches may produce many different user clusters with similar interests. Similarly, content-based clustering approaches on Twitter Data produce inferior results due limited length of Tweet and due to lots of garbled data. Hence, these approaches cannot be directly used for practical applications. In this paper, we have made an effort to cluster Twitter Users based on their interest, looking at both the structure of the graph generated using Twitter Data, as well as its contents. By combining these approaches, we improve our results compared to the existing techniques, thereby generating results befitting the practical applications.

Anup Naik, Hideyuki Maeda, Vibhor Kanojia, Sumio Fujita
An Enhanced Markov Clustering Algorithm Based on Physarum

Community mining is a vital problem for complex network analysis. Markov chains based algorithms are known as its easy-to-implement and have provided promising solutions for community mining. Existing Markov clustering algorithms have been optimized from the aspects of parallelization and penalty strategy. However, the dynamic process for enlarging the inhomogeneity attracts little attention. As the key mechanism of Markov chains based algorithms, such process affects the qualities of divisions and computational cost directly. This paper proposes a hybrid algorithm based on Physarum, a kind of slime. The new algorithm enhances the dynamic process of Markov clustering algorithm by embedding the Physarum-inspired feedback system. Specifically, flows between vertexes can enhance the corresponding transition probability in Markov clustering algorithms, and vice versa. Some networks with known and unknown community structures are used to estimate the performance of our proposed algorithms. Extensive experiments show that the proposed algorithm has higher NMI, Q values and lower computational cost than that of the typical algorithms.

Mingxin Liang, Chao Gao, Xianghua Li, Zili Zhang
Exploiting Geographical Location for Team Formation in Social Coding Sites

Social coding sites (SCSs) such as GitHub and BitBucket are collaborative platforms where developers from different background (e.g., culture, language, location, skills) form a team to contribute to a shared project collaboratively. One essential task of such collaborative development is how to form a optimal team where each member makes his/her greatest contribution, which may have a great effect on the efficiency of collaboration. To the best of knowledge, all existing related works model the team formation problem as minimizing the communication cost among developers or taking the workload of individuals into account, ignoring the impact of geographical location of each developer. In this paper, we aims to exploit the geographical proximity factor to improve the performance of team formation in social coding sites. Specifically, we incorporate the communication cost and geographical proximity into a unified objective function and propose a genetic algorithm to optimize it. Comprehensive experiments on a real-world dataset (e.g., GitHub) demonstrate the performance of the proposed model with the comparison of some state-of-the-art ones.

Yuqiang Han, Yao Wan, Liang Chen, Guandong Xu, Jian Wu
Weighted Simplicial Complex: A Novel Approach for Predicting Small Group Evolution

The study of small collaborations or teams is an important endeavor both in industry and academia. The social phenomena responsible for formation or evolution of such small groups is quite different from those for dyadic relations like friendship or large size guilds (or communities). In small groups when social actors collaborate for various tasks over time, the actors common across collaborations act as bridges which connect groups into a network of groups. Evolution of groups is affected by this network structure. Building appropriate models for this network is an important problem in the study of group evolution. This work focuses on the problem of group recurrence prediction. In order to overcome the shortcomings of two traditional group network modeling approaches: hypergraph and simplicial complex, we propose a hybrid approach: Weighted Simplicial Complex (WSC). We develop a Hasse diagram based framework to study WSCs and build several predictive models for group recurrence based on this approach. Our results demonstrate the effectiveness of our approach.

Ankit Sharma, Terrence J. Moore, Ananthram Swami, Jaideep Srivastava
A P-LSTM Neural Network for Sentiment Classification

Neural network models have been demonstrated to be capable of achieving remarkable performance in sentiment classification. Convolutional neural network (CNN) and recurrent neural network (RNN) are two mainstream architectures for such modelling task. In this work, a novel model based on long short-term memory recurrent neural network (LSTM) called P-LSTM is proposed for sentiment classification. In P-LSTM, three-words phrase embedding is used instead of single word embedding as is often done. Besides, P-LSTM introduces the phrase factor mechanism which combines the feature vectors of the phrase embedding layer and the LSTM hidden layer to extract more exact information from the text. The experimental results show that the P-LSTM achieves excellent performance on the sentiment classification tasks.

Chi Lu, Heyan Huang, Ping Jian, Dan Wang, Yi-Di Guo
Learning What Matters – Sampling Interesting Patterns

In the field of exploratory data mining, local structure in data can be described by patterns and discovered by mining algorithms. Although many solutions have been proposed to address the redundancy problems in pattern mining, most of them either provide succinct pattern sets or take the interests of the user into account—but not both. Consequently, the analyst has to invest substantial effort in identifying those patterns that are relevant to her specific interests and goals.To address this problem, we propose a novel approach that combines pattern sampling with interactive data mining. In particular, we introduce the LetSIP algorithm, which builds upon recent advances in (1) weighted sampling in SAT and (2) learning to rank in interactive pattern mining. Specifically, it exploits user feedback to directly learn the parameters of the sampling distribution that represents the user’s interests.We compare the performance of the proposed algorithm to the state-of-the-art in interactive pattern mining by emulating the interests of a user. The resulting system allows efficient and interleaved learning and sampling, thus user-specific anytime data exploration. Finally, LetSIP demonstrates favourable trade-offs concerning both quality–diversity and exploitation–exploration when compared to existing methods.

Vladimir Dzyuba, Matthijs van Leeuwen
PNE: Label Embedding Enhanced Network Embedding

Unsupervised NRL (Network Representation Learning) methods only consider the network structure information, which makes their learned node representations less discriminative. To utilize the label information of the partially labeled network, several semi-supervised NRL methods are proposed. The key idea of these methods is to merge the representation learning step and the classifier training step together. However, it is not flexible enough and their parameters are often hard to tune. In this paper, we provide a new point of view for semi-supervised NRL and present a novel model named Predictive Network Embedding (PNE). Briefly, we embed nodes and labels into the same latent space instead of training a classifier in the representation learning process. Thus the discriminability of node representations is enhanced by incorporating the label information. We conduct node classification task on four real world datasets. The experimental results demonstrate that our model significantly outperforms the state-of-the-art baselines.

Weizheng Chen, Xianling Mao, Xiangyu Li, Yan Zhang, Xiaoming Li
Improving Temporal Record Linkage Using Regression Classification

Temporal record linkage is the process of identifying groups of records that are collected over a period of time, such as in census or voter registration databases, where records in the same group represent the same real-world entity. Such databases often contain temporal information, such as the time when a record was created or when it was modified. Unlike traditional record linkage, which considers differences between records from the same entity as errors or variations, temporal record linkage aims to capture records from entities where the attribute values are known to change over time. In this paper we propose a novel approach that extends an existing temporal approach called decay model, to categorically calculate probabilities of change for each attribute. Our novel method uses a regression-based machine learning model to predict decays for sets of attributes. Each such set of attributes has a principle attribute and support attributes, where values of the support attributes can affect the decay of the principle attribute. Our experimental results on a real US voter database show that our proposed approach results in better linkage quality compared to the decay model approach.

Yichen Hu, Qing Wang, Dinusha Vatsalan, Peter Christen
Community Detection in Graph Streams by Pruning Zombie Nodes

Detecting communities in graph streams has attracted a large amount of attention recently. Although many algorithms have been developed from different perspectives, there is still a limitation to the existing methods, that is, most of them neglect the “zombie” nodes (or unimportant nodes) in the graph stream which may badly affect the community detection result. In this paper, we aim to deal with the zombie nodes in networks so as to enhance the robustness of the detected communities. The key here is to design a pruning strategy to remove unimportant nodes and preserve the important nodes. We propose to recognize the zombie nodes by a degree centrality calculated from the exponential time-decaying edge weights, which can be efficiently updated in the graph stream case. Based on only important and active nodes, community kernels can be constructed, from which robust community structures can be obtained. One advantage of the proposed pruning strategy is that it is able to eliminate the effect of the aforementioned “zombie” nodes, leading to robust communities. By designing an efficient way to update the degree centrality, the important and active nodes can be easily obtained at each timestamp, leading to the reduction of computational complexity. Experiments have been conducted to show the effectiveness of the proposed method.

Yue Ding, Ling Huang, Chang-Dong Wang, Dong Huang
Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining

In this paper, we propose to complement the context vectors used in bilingual lexicon extraction from comparable corpora with concept vectors, that aim at capturing all the words related to the concepts associated with a given word. This allows one to rely on a representation that is less sparse, especially in specialized domains where the use of a general bilingual lexicon leaves many words untranslated. The concept vectors we are considering are based on closed concepts mining developed in Formal Concept Analysis (FCA). The obtained results on two different comparable corpora show that enriching context vectors with concept vectors leads to lexicons of higher quality, especially in specialized domains.

Mohamed Chebel, Chiraz Latiri, Eric Gaussier
Collective Geographical Embedding for Geolocating Social Network Users

Inferring the physical locations of social network users is one of the core tasks in many online services, such as targeted advertisement, recommending local events, and urban computing. In this paper, we introduce the Collective Geographical Embedding (CGE) algorithm to embed multiple information sources into a low dimensional space, such that the distance in the embedding space reflects the physical distance in the real world. To achieve this, we introduced an embedding method with a location affinity matrix as a constraint for heterogeneous user network. The experiments demonstrate that the proposed algorithm not only outperforms traditional user geolocation prediction algorithms by collectively extracting relations hidden in the heterogeneous user network, but also outperforms state-of-the-art embedding algorithms by appropriately casting geographical information of check-in.

Fengjiao Wang, Chun-Ta Lu, Yongzhi Qu, Philip S. Yu

Privacy-Preserving Mining and Security/Risk Applications

Frontmatter
Partitioning-Based Mechanisms Under Personalized Differential Privacy

Differential privacy has recently emerged in private statistical aggregate analysis as one of the strongest privacy guarantees. A limitation of the model is that it provides the same privacy protection for all individuals in the database. However, it is common that data owners may have different privacy preferences for their data. Consequently, a global differential privacy parameter may provide excessive privacy protection for some users, while insufficient for others. In this paper, we propose two partitioning-based mechanisms, privacy-aware and utility-based partitioning, to handle personalized differential privacy parameters for each individual in a dataset while maximizing utility of the differentially private computation. The privacy-aware partitioning is to minimize the privacy budget waste, while utility-based partitioning is to maximize the utility for a given aggregate analysis. We also develop a t-round partitioning to take full advantage of remaining privacy budgets. Extensive experiments using real datasets show the effectiveness of our partitioning mechanisms.

Haoran Li, Li Xiong, Zhanglong Ji, Xiaoqian Jiang
Efficient Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage

Privacy-preserving record linkage (PPRL) is the process of identifying records that represent the same entity across databases held by different organizations without revealing any sensitive information about these entities. A popular technique used in PPRL is Bloom filter encoding, which has shown to be an efficient and effective way to encode sensitive information into bit vectors while still enabling approximate matching of attribute values. However, the encoded values in Bloom filters are vulnerable to cryptanalysis attacks. Under specific conditions, these attacks are successful in that some frequent sensitive attribute values can be re-identified. In this paper we propose and evaluate on real databases a novel efficient attack on Bloom filters. Our approach is based on the construction principle of Bloom filters of hashing elements of sets into bit positions. The attack is independent of the encoding function and its parameters used, it can correctly re-identify sensitive attribute values even when various recently proposed hardening techniques have been applied, and it runs in a few seconds instead of hours.

Peter Christen, Rainer Schnell, Dinusha Vatsalan, Thilina Ranbaduge
Energy-Based Localized Anomaly Detection in Video Surveillance

Automated detection of abnormal events in video surveillance is an important task in research and practical applications. This is, however, a challenging problem due to the growing collection of data without the knowledge of what to be defined as “abnormal”, and the expensive feature engineering procedure. In this paper we introduce a unified framework for anomaly detection in video based on the restricted Boltzmann machine ($$\text {RBM}$$), a recent powerful method for unsupervised learning and representation learning. Our proposed system works directly on the image pixels rather than hand-crafted features, it learns new representations for data in a completely unsupervised manner without the need for labels, and then reconstructs the data to recognize the locations of abnormal events based on the reconstruction errors. More importantly, our approach can be deployed in both offline and streaming settings, in which trained parameters of the model are fixed in offline setting whilst are updated incrementally with video data arriving in a stream. Experiments on three publicly benchmark video datasets show that our proposed method can detect and localize the abnormalities at pixel level with better accuracy than those of baselines, and achieve competitive performance compared with state-of-the-art approaches. Moreover, as RBM belongs to a wider class of deep generative models, our framework lays the groundwork towards a more powerful deep unsupervised abnormality detection framework.

Hung Vu, Tu Dinh Nguyen, Anthony Travers, Svetha Venkatesh, Dinh Phung
A Fast Fourier Transform-Coupled Machine Learning-Based Ensemble Model for Disease Risk Prediction Using a Real-Life Dataset

The use of intelligent technologies in clinical decision making have started playing a vital role in improving the quality of patients’ life and helping in reduce cost and workload involved in their daily healthcare. In this paper, a novel fast Fourier transform-coupled machine learning based ensemble model is adopted for advising patients concerning whether they need to take the body test today or not based on the analysis of their medical data during the past a few days. The weighted-vote based ensemble attempts to predict the patients condition one day in advance by analyzing medical measurements of patient for the past k days. A combination of three algorithms namely neural networks, support vector machine and Naive Bayes are utilized to make an ensemble framework. A time series telehealth data recorded from patients is used for experimentations, evaluation and validation. The Tunstall dataset were collected from May to October 2012, from industry collaborator Tunstall. The experimental evaluation shows that the proposed model yields satisfactory recommendation accuracy, offers a promising way for reducing the risk of incorrect recommendations and also saving the workload for patients to conduct body tests every day. The proposed method is, therefore, a promising tool for analysis of time series data and providing appropriate recommendations to patients suffering chronic diseases with improved prediction accuracy.

Raid Lafta, Ji Zhang, Xiaohui Tao, Yan Li, Wessam Abbas, Yonglong Luo, Fulong Chen, Vincent S. Tseng
Assessing Death Risk of Patients with Cardiovascular Disease from Long-Term Electrocardiogram Streams Summarization

Cardiovascular disease (CVD) is the leading cause of death around the world. Researches on assessing patients death risk from Electrocardiographic (ECG) data has attracted increasing attention recently. In this paper, we summarize long-term overwhelming ECG data using morphological concern of overall evolution. And then assessing patients death risk from high value density ECG summarization instead of raw data. Our method is totally unsupervised without the help of expert knowledge. Moreover, it can assist in clinical practice without any additional burden like buy new devices or add more caregivers. Comprehensive results show effectiveness of our method.

Shenda Hong, Meng Wu, Jinbo Zhang, Hongyan Li

Spatio-Temporal and Sequential Data Mining

Frontmatter
On the Robustness of Decision Tree Learning Under Label Noise

In most practical problems of classifier learning, the training data suffers from label noise. Most theoretical results on robustness to label noise involve either estimation of noise rates or non-convex optimization. Further, none of these results are applicable to standard decision tree learning algorithms. This paper presents some theoretical analysis to show that, under some assumptions, many popular decision tree learning algorithms are inherently robust to label noise. We also present some sample complexity results which provide some bounds on the sample size for the robustness to hold with a high probability. Through extensive simulations we illustrate this robustness.

Aritra Ghosh, Naresh Manwani, P. S. Sastry
Relevance-Based Evaluation Metrics for Multi-class Imbalanced Domains

The class imbalance problem is a key issue that has received much attention. This attention has been mostly focused on two-classes problems. Fewer solutions exist for the multi-classes imbalance problem. From an evaluation point of view, the class imbalance problem is challenging because a non-uniform importance is assigned to the classes. In this paper, we propose a relevance-based evaluation framework that incorporates user preferences by allowing the assignment of differentiated importance values to each class. The presented solution is able to overcome difficulties detected in existing measures and increases discrimination capability. The proposed framework requires the assignment of a relevance score to the problem classes. To deal with cases where the user is not able to specify each class relevance, we describe three mechanisms to incorporate the existing domain knowledge into the relevance framework. These mechanisms differ in the amount of information available and assumptions made regarding the domain. They also allow the use of our framework in common settings of multi-class imbalanced problems with different levels of information available.

Paula Branco, Luís Torgo, Rita P. Ribeiro
Location Prediction Through Activity Purpose: Integrating Temporal and Sequential Models

Based on the growing popularity of smart mobile devices, location-aware services become indispensable in human daily life. Location prediction makes these services more intelligent and attractive. However, due to the limited energy of mobile devices and privacy issues, the captured mobility data is typically sparse. This inherent challenge deteriorates significant principles in mobility modeling, i.e. temporal regularity and sequential dependency. To tackle these challenges, by utilizing temporal regularity and sequential dependency, we present a location prediction model with a two-stage fashion. Firstly, it extracts predictive features to effectively target the better performer from sequential and temporal models. Secondly, according to the inferred activity, it adopts non-parametric Kernel Density Estimation for posterior location prediction. Extensive experiments on two public check-in datasets demonstrate that the proposed model outperforms state-of-the-art baselines by 10.1% for activity prediction and 12.9% for location prediction.

Dongliang Liao, Yuan Zhong, Jing Li
Modeling Temporal Behavior of Awards Effect on Viewership of Movies

The “rich get richer” effect is well-known in recommendation system. Popular items are recommended more, then purchased more, resulting in becoming even more popular over time. For example, we observe in Netflix data that awarded movies are more popular than non-awarded movies. Unlike other work focusing on making fair/neutralized recommendation, in this paper, we target on modeling the effect of awards on the viewership of movies. The main challenge of building such a model is that the effect on popularity changes over time with different intensity from movie to movie. Our proposed approach explicitly models the award effects for each movie and enables the recommendation system to provide a better ranked list of recommended movies. The results of an extensive empirical validation on Netflix and MovieLens data demonstrate the effectiveness of our model.

Basmah Altaf, Faisal Kamiran, Xiangliang Zhang
A Physarum-Inspired Ant Colony Optimization for Community Mining

Community mining is a powerful tool for discovering the knowledge of networks and has a wide application. The modularity is one of very popular measurements for evaluating the efficiency of community divisions. However, the modularity maximization is a NP-complete problem. As an effective optimization algorithm for solving NP-complete problems, ant colony based community detection algorithm has been proposed to deal with such task. However the low accuracy and premature still limit its performance. Aiming to overcome those shortcomings, this paper proposes a novel nature-inspired optimization for the community mining based on the Physarum, a kind of slime molds cells. In the proposed strategy, the Physarum-inspired model optimizes the heuristic factor of ant colony algorithm by endowing edges with weights. With the information of weights provided by the Physarum-inspired model, the optimized heuristic factor can improve the searching abilities of ant colony algorithms. Four real-world networks and two typical kinds of ant colony optimization algorithms are used for estimating the efficiency of proposed strategy. Experiments show that the optimized ant colony optimization algorithms can achieve a better performance in terms of robustness and accuracy with a lower computational cost.

Mingxin Liang, Chao Gao, Xianghua Li, Zili Zhang
A Novel Diversity Measure for Understanding Movie Ranks in Movie Collaboration Networks

We are interested in the relationship between the team composition and the outcome in the filmmaking process. We studied the “diversity” of the group of actors and directors and how it is related to the movie rank given by the audience. The “diversity” is considered as the representation of the degree of variety based on the possibilities of collaborations among its actors and directors. Their collaboration network for the movie was first generated from the “background” network of the collaborations from other works. Then a shortest-path method together with the Adamic/Adar method are used to form indirect links. Finally the “complete” collaboration network can be generated and the “diversity” measures are thus defined accordingly. We experimented on the France and Germany datasets and identified consistent patterns: the lower the “diversity” is, the lower the movie rank will be. We also demonstrated that a subset of our diversity measures were effective in the binary classification task for movie ranks, while the advantages are prone to Precision/Recall depending on the specific dataset. This further shows that the “diversity” measure is feasible and effective in distinguishing movie ranks.

Manqing Ma, Wei Pang, Lan Huang, Zhe Wang
Local-to-Global Unsupervised Anomaly Detection from Temporal Data

Anomaly detection for temporal data has received much attention by many real-world applications. Most existing unsupervised methods dealing with this task are based on a sequential two-way approach (clustering and detection). Because of this, the clustering is less robust to anomalous series in data which distorts the detection step. Thus, to overcome this problem, we propose an embedded technique simultaneously dealing with both methods. We reformulate the task of anomaly detection as a local-weighting-instance clustering problem. The anomalous series are detected locally in each cluster as well as globally in the data, as a whole. Extensive experiments on benchmark datasets are carried out to validate our approach and compare it with other state-of-the-art methods of detection.

Seif-Eddine Benkabou, Khalid Benabdeslem, Bruno Canitia
Mining Temporal Fluctuating Patterns

In this paper, we explore a new mining paradigm, called Temporal Fluctuating Patterns (abbreviated as TFP), to discover potentially fluctuating and useful feature sets from temporal data. These feature sets have some properties which are variant through time series. Once TFPs are discovered, we can find the turning points of patterns, which enables anomaly detection and transformation discovery over time. For example, the discovery of TFPs can possibly figure out the phenomenon of virus variation during the epidemic outbreak, further providing the government the clue for the epidemic control. However, previous work on mining temporal data computes frequent sets iteratively for different time periods, which is time-consuming. We, therefore, develop a union-based mining structure to speed up the mining process and dynamically compute the fluctuations of patterns through time series. As shown in our experimental studies, the proposed framework can efficiently discover TFPs on a real epidemic disease dataset, showing its prominent advantages to be utilized in real applications.

Shan-Yun Teng, Cheng-Kuan Ou, Kun-Ta Chuang
Marked Temporal Dynamics Modeling Based on Recurrent Neural Network

We are now witnessing the increasing availability of event stream data, i.e., a sequence of events with each event typically being denoted by the time it occurs and its mark information (e.g., event type). A fundamental problem is to model and predict such kind of marked temporal dynamics, i.e., when the next event will take place and what its mark will be. Existing methods either predict only the mark or the time of the next event, or predict both of them, yet separately. Indeed, in marked temporal dynamics, the time and the mark of the next event are highly dependent on each other, requiring a method that could simultaneously predict both of them. To tackle this problem, in this paper, we propose to model marked temporal dynamics by using a mark-specific intensity function to explicitly capture the dependency between the mark and the time of the next event. Experiments on two datasets demonstrate that the proposed method outperforms the state-of-the-art methods at predicting marked temporal dynamics.

Yongqing Wang, Shenghua Liu, Huawei Shen, Jinhua Gao, Xueqi Cheng
Mining of Location-Based Social Networks for Spatio-Temporal Social Influence

Following the advent of location-based social networks (LBSNs), location-aware services have attracted considerable attention among researchers. Research has shown that the social network is regarded as one of the strongest influences shaping individual attitudes and behaviors. This paper targets the mining of location-based social influences hidden in LBSNs. In other words, we sought to determine whether an individual’s check-in behavior is influenced by friends’ check-ins. Check-in data includes positional information; therefore, we refer to this type of influence as spatiotemporal social influences. This study proposes a framework for spatiotemporal social influence mining (ST-SIM) to identify users with the greatest influence on individuals (i.e., close friends and travel experts) from an LBSN and estimate the strength of these social connections. Explicitly, the proposed framework is able to infer a list of influential users of an individual under given conditions based on travel distance, visiting time or POI categories. We developed a diffusion-based mechanism for modeling the propagation of influence over time. Our experiment results demonstrate that the ST-SIM framework outperforms state-of-the-art methods in terms of accuracy and reliability, and is applicable in domains ranging from marketing to intelligence analysis.

Yu-Ting Wen, Yi Yuan Fan, Wen-Chih Peng
Multi-perspective Hierarchical Dirichlet Process for Geographical Topic Modeling

The pervasion of location acquisition technology has strongly propelled the popularity of geo-tagged user-generated content (UGC), which also raises new computational possibility for investigating geographical topics and users’ spatial behaviors. This paper proposes a novel method for geographical topic modeling by combining text content with user information and spatial knowledge. Topics are estimated as the interests of users and features of locations. The joint modeling of the three heterogeneous sources (1) leads to high accuracy in predicting visit behaviors driven by personal interests, (2) discovers coherent topic representations for topic modeling, (3) enables the recommender system to suggest interpretable locations. Our framework is flexible to incorporate new dimensions of data such as temporal information without substantially changing the model structure. We also experimentally demonstrate the limitations of the traditional assumption that a topic is selected considerably dependent on the location. In many cases, the published topics are mainly affected by the user’s interests rather than the current location. Our model discriminates these two scenarios. Through employing hierarchical Dirichlet process, we also need not predefine the number of topics like other mixture models. Experiments on three different datasets show that our model is effective in discovering spatial topics and significantly outperforms the state of the art.

Yuan He, Cheng Wang, Changjun Jiang
Enumerating Non-redundant Association Rules Using Satisfiability

Discovering association rules from transaction databases is a well studied data mining task. Many effective techniques have been proposed over the years. However, due to the huge size of the output, many works have tackled the problem of mining a smaller and relevant set of rules. In this paper, we address the problem of enumerating the minimal non-redundant association rules, widely considered as one of the most relevant variant. We first provide its encoding as a propositional formula whose models correspond to the minimal non redundant rules. Then we show that the set of minimal generators used for extracting non-redundant rules can also be encoded in this framework. Experiments on many datasets show that our approach achieves better performance with respect to the state-of-the-art specialized techniques.

Abdelhamid Boudane, Said Jabbour, Lakhdar Sais, Yakoub Salhi
Backmatter
Metadata
Title
Advances in Knowledge Discovery and Data Mining
Editors
Jinho Kim
Kyuseok Shim
Longbing Cao
Jae-Gil Lee
Xuemin Lin
Yang-Sae Moon
Copyright Year
2017
Electronic ISBN
978-3-319-57454-7
Print ISBN
978-3-319-57453-0
DOI
https://doi.org/10.1007/978-3-319-57454-7

Premium Partner