Top

2014 | Book

Read chapter Read first chapter

Advances in Knowledge Discovery and Data Mining

18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part I

Editors: Vincent S. Tseng, Tu Bao Ho, Zhi-Hua Zhou, Arbee L. P. Chen, Hung-Yu Kao

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

The two-volume set LNAI 8443 + LNAI 8444 constitutes the refereed proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014, held in Tainan, Taiwan, in May 2014. The 40 full papers and the 60 short papers presented within these proceedings were carefully reviewed and selected from 371 submissions. They cover the general fields of pattern mining; social network and social media; classification; graph and network mining; applications; privacy preserving; recommendation; feature selection and reduction; machine learning; temporal and spatial data; novel algorithms; clustering; biomedical data mining; stream mining; outlier and anomaly detection; multi-sources mining; and unstructured data and text mining.

Frontmatter

Pattern Mining

MalSpot: Multi2 Malicious Network Behavior Patterns Analysis

What are the patterns that typical network attackers exhibit? For a given malicious network behaviors, are its attacks spread uniformly over time? In this work, we develop

MalSpot

multi-resolution and multi-linear (Multi

) network analysis system in order to discover such malicious patterns, so that we can use them later for attack detection, when attacks are concurrent with legitimate traffic. We designed and deployed

MalSpot

which employs multi-linear analysis with different time resolutions, running on top of MapReduce (Hadoop), and we identify patterns across attackers, attacked institutions and variation of time scales. We collect over a terabyte of proven malicious traces (along with benign ones), from the Taiwanese government security operation center (G-SOC) , during the entire year of 2012. We showcase the effectiveness of

MalSpot

by discovering interesting patterns and anomalies on this enormous dataset. We observed static and time-evolving patterns, that a vast majority of the known malicious behavior seem to follow.

Hing-Hao Mao, Chung-Jung Wu, Evangelos E. Papalexakis, Christos Faloutsos, Kuo-Chen Lee, Tien-Cheu Kao

Extracting Diverse Patterns with Unbalanced Concept Hierarchy

The process of frequent pattern extraction finds interesting information about the association among the items in a transactional database. The notion of

support

is employed to extract the frequent patterns. Normally, in a given domain, a set of items can be grouped into a category and a pattern may contain the items which belong to multiple categories. In several applications, it may be useful to distinguish between the pattern having items belonging to multiple categories and the pattern having items belonging to one or a few categories. The notion of diversity captures the extent the items in the pattern belong to multiple categories. The items and the categories form a concept hierarchy. In the literature, an approach has been proposed to rank the patterns by considering the balanced concept hierarchy. In a real life scenario, the concept hierarchies are normally unbalanced. In this paper, we propose a general approach to calculate the rank based on the diversity, called

drank

, by considering the unbalanced concept hierarchy. The experiment results show that the patterns ordered based on

drank

are different from the patterns ordered based on

support

, and the proposed approach could assign the

drank

to different kinds of unbalanced patterns.

M. Kumara Swamy, P. Krishna Reddy, Somya Srivastava

Efficiently Depth-First Minimal Pattern Mining

Condensed representations have been studied extensively for 15 years. In particular, the maximal patterns of the equivalence classes have received much attention with very general proposals. In contrast, the minimal patterns remained in the shadows in particular because of their difficult extraction. In this paper, we present a generic framework for minimal patterns mining by introducing the concept of minimizable set system. This framework addresses various languages such as itemsets or strings, and at the same time, different metrics such as frequency. For instance, the free and the essential patterns are naturally handled by our approach, just as the minimal strings. Then, for any minimizable set system, we introduce a fast minimality check that is easy to incorporate in a depth-first search algorithm for mining the minimal patterns. We demonstrate that it is polynomial-delay and polynomial-space. Experiments on traditional benchmarks complete our study.

Arnaud Soulet, François Rioult

Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information

Sequential pattern mining algorithms using a vertical representation are the most efficient for mining sequential patterns in dense or long sequences, and have excellent overall performance. The vertical representation allows generating patterns and calculating their supports without performing costly database scans. However, a crucial performance bottleneck of vertical algorithms is that they use a generate-candidate-and-test approach that can generate a large amount of infrequent candidates.To address this issue, we propose pruning candidates based on the study of item co-occurrences. We present a new structure named CMAP (Co-occurence MAP) for storing co-occurrence information. We explain how CMAP can be used to prune candidates in three state-of-the-art vertical algorithms, namely SPADE, SPAM and ClaSP. An extensive experimental study with six real-life datasets shows that (1) co-occurrence-based pruning is effective, (2) CMAP is very compact and that (3) the resulting algorithms outperform state-of-the-art algorithms for mining sequential patterns (GSP, PrefixSpan, SPADE and SPAM) and closed sequential patterns (ClaSP and CloSpan).

Philippe Fournier-Viger, Antonio Gomariz, Manuel Campos, Rincy Thomas

Social Network and Social Media

Balancing the Analysis of Frequent Patterns

A main challenge in pattern mining is to focus the discovery on high-quality patterns. One popular solution is to compute a numerical score on how well each discovered pattern describes the data. The best rating patterns are then the most analyzed by the data expert. In this paper, we evaluate the quality of discovered patterns by anticipating of how user analyzes them. We show that the examination of frequent patterns with the notion of support led to an unbalanced analysis of the dataset. Certain transactions are indeed completely ignored. Hence, we propose the notion of balanced support that weights the transactions to let each of them receive user specified attention. We also develop an algorithm Absolute for calculating these weights leading to evaluate the quality of patterns. Our experiments on frequent itemsets validate its effectiveness and show the relevance of the balanced support.

Arnaud Giacometti, Dominique H. Li, Arnaud Soulet

Balanced Seed Selection for Budgeted Influence Maximization in Social Networks

Given a budget and a network where different nodes have different costs to be selected, the budgeted influence maximization is to select seeds on budget so that the number of final influenced nodes can be maximized. In this paper, we propose three strategies to solve this problem. First, Billboard strategy chooses the most influential nodes as the seeds. Second, Handbill strategy chooses the most cost-effective nodes as the seeds. Finally, Combination strategy chooses the “best” seeds from two “better” seed sets obtained from the former two strategies. Experiments show that Billboard strategy and Handbill strategy can obtain good solution efficiently. Combination strategy is the best algorithm or matches the best algorithm in terms of both accuracy and efficiency, and it is more balanced than the state-of-the-art algorithms.

Shuo Han, Fuzhen Zhuang, Qing He, Zhongzhi Shi

Persistent Community Detection in Dynamic Social Networks

While community detection is an active area of research in social network analysis, little effort has been devoted to community detection using time-evolving social network data. We propose an algorithm, Persistent Community Detection (PCD), to identify those communities that exhibit persistent behavior over time, for usage in such settings. Our motivation is to distinguish between steady-state network activity, and impermanent behavior such as cascades caused by a noteworthy event. The results of extensive empirical experiments on real-life big social networks data show that our algorithm performs much better than a set of baseline methods, including two alternative models and the state-of-the-art.

Siyuan Liu, Shuhui Wang, Ramayya Krishnan

Detecting and Analyzing Influenza Epidemics with Social Media in China

In recent years, social media has become important and omnipresent for social network and information sharing. Researchers and scientists have begun to mine social media data to predict varieties of social, economic, health and entertainment related real-world phenomena. In this paper, we exhibit how social media data can be used to detect and analyze real-world phenomena with several data mining techniques. Specifically, we use posts from TencentWeibo to detect influenza and analyze influenza trends. We build a support vector machine (SVM) based classifier to classify influenza posts. In addition, we use association rule mining to extract strongly associated features as additional features of posts to overcome the limitation of 140 words for posts. We also use sentimental analysis to classify the reposts without feature and uncommented reposts. The experimental results show that by combining those techniques, we can improve the precision and recall by at least ten percent. Finally, we analyze the spatial and temporal patterns for positive influenza posts and tell when and where influenza epidemic is more likely to occur.

Fang Zhang, Jun Luo, Chao Li, Xin Wang, Zhongying Zhao

Analyzing Location Predictability on Location-Based Social Networks

With the growing popularity of location-based social networks, vast amount of user check-in histories have been accumulated. Based on such historical data, predicting a user’s next check-in place is of much interest recently. There is, however, little study on the limit of predictability of this task and its correlation with users’ demographics. These studies can give deeper insight to the prediction task and bring valuable insights to the design of new prediction algorithms. In this paper, we carry out a thorough study on the limit of check-in location predictability, i.e., to what extent the next locations are predictable, in the presence of special properties of check-in traces. Specifically, we begin with estimating the entropy of an individual check-in trace and then leverage Fano’s inequality to transform it to predictability. Extensive analysis has then been performed on two large-scale check-in datasets from Jiepang and Gowalla with 36M and 6M check-ins, respectively. As a result, we find 25% and 38% potential predictability respectively. Finally, the correlation analysis between predictability and users’ demographics has been performed. The results show that the demographics, such as gender and age, are significantly correlated with location predictability.

Defu Lian, Yin Zhu, Xing Xie, Enhong Chen

ReadBehavior: Reading Probabilities Modeling of Tweets via the Users’ Retweeting Behaviors

Along with

twitter

’s tremendous growth, studying users’ behaviors, such as retweeting behavior, have become an interesting research issue. In literature, researchers usually assumed that the

twitter

user could catch up with all the tweets posted by his/her friends. This is untrue most of the time. Intuitively, modeling the reading probability of each tweet is of practical importance in various applications, such as social influence analysis. In this paper, we propose a

ReadBehavior

model to measure the probability that a user reads a specific tweet. The model is based on the user’s retweeting behaviors and the correlation between the tweets’ posting time and retweeting time. To illustrate the effectiveness of our proposed model, we develop a

PageRank

-like algorithm to find influential users. The experimental results show that the algorithm based on

ReadBehavior

outperforms other related algorithms, which indicates the effectiveness of the proposed model.

Jianguang Du, Dandan Song, Lejian Liao, Xin Li, Li Liu, Guoqiang Li, Guanguo Gao, Guiying Wu

Inferring Strange Behavior from Connectivity Pattern in Social Networks

Given a multimillion-node social network, how can we summarize connectivity pattern from the data, and how can we find unexpected user behavior? In this paper we study a complete graph from a large who-follows-whom network and spot lockstep behavior that large groups of followers connect to the same groups of followees. Our first contribution is that we study strange patterns on the adjacency matrix and in the spectral subspaces with respect to several flavors of lockstep. We discover that (a) the lockstep behavior on the graph shapes dense “block” in its adjacency matrix and creates “ray” in spectral subspaces, and (b) partially overlapping of the behavior shapes “staircase” in the matrix and creates “pearl” in the subspaces. The second contribution is that we provide a fast algorithm, using the discovery as a guide for practitioners, to detect users who offer the lockstep behavior. We demonstrate that our approach is effective on both synthetic and real data.

Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, Shiqiang Yang

Inferring Attitude in Online Social Networks Based on Quadratic Correlation

The structure of an online social network in most cases cannot be described just by links between its members. We study online social networks, in which members may have certain attitude, positive or negative, toward each other, and so the network consists of a mixture of both positive and negative relationships. Our goal is to predict the sign of a given relationship based on the evidences provided in the current snapshot of the network. More precisely, using machine learning techniques we develop a model that after being trained on a particular network predicts the sign of an unknown or hidden link. The model uses relationships and influences from peers as evidences for the guess, however, the set of peers used is not predefined but rather learned during the training process. We use quadratic correlation between peer members to train the predictor. The model is tested on popular online datasets such as Epinions, Slashdot, and Wikipedia. In many cases it shows almost perfect prediction accuracy. Moreover, our model can also be efficiently updated as the underlying social network evolves.

Cong Wang, Andrei A. Bulatov

Hash-Based Stream LDA: Topic Modeling in Social Streams

We study the problem of topic modeling in continuous social media streams and propose a new generative probabilistic model called Hash-Based Stream LDA (HS-LDA), which is a generalization of the popular LDA approach. The model differs from LDA in that it exposes facilities to include inter-document similarity in topic modeling. The corresponding inference algorithm outlined in the paper relies on efficient estimation of document similarity with Locality Sensitive Hashing to retain the knowledge of past social discourse in a scalable way. The historical knowledge of previous messages is used in inference to improve quality of topic discovery. Performance of the new algorithm was evaluated against classical LDA approach as well as the stream-oriented On-line LDA and SparseLDA using data sets collected from the Twitter microblog system and an IRC chat community. Experimental results showed that HS-LDA outperformed other techniques by more than 12% for the Twitter dataset and by 21% for the IRC data in terms of average perplexity.

Anton Slutsky, Xiaohua Hu, Yuan An

Two Sides of a Coin: Separating Personal Communication and Public Dissemination Accounts in Twitter

There are millions of accounts in Twitter. In this paper, we categorize twitter accounts into two types, namely

Personal Communication Account (PCA)

and

Public Dissemination Account (PDA)

. PCAs are accounts operated by individuals and are used to express that individual’s thoughts and feelings. PDAs, on the other hand, refer to accounts owned by non-individuals such as companies, governments, etc. Generally, Tweets in PDA (i) disseminate a specific type of information (e.g., job openings, shopping deals, car accidents) rather than sharing an individual’s personal life; and (ii) may be produced by non-human entities (e.g., bots). We aim to develop techniques for identifying PDAs so as to (i) facilitate social scientists to reduce “noise” in their study of human behaviors, and (ii) to index them for potential recommendation to users looking for specific types of information. Through analysis, we find these two types of accounts follow different temporal, spatial and textual patterns. Accordingly we develop probabilistic models based on these features to identify PDAs. We also conduct a series of experiments to evaluate those algorithms for cleaning the Twitter data stream.

Peifeng Yin, Nilam Ram, Wang-Chien Lee, Conrad Tucker, Shashank Khandelwal, Marcel Salathé

Latent Features Based Prediction on New Users’ Tastes

Recommendation systems have become popular in recent years. A key challenge in such systems is how to effectively characterize new users’ tastes — an issue that is generally known as the cold-start problem. New users judge the system by the ability to immediately provide them with what they consider interesting. A general method for solving the cold-start problem is to elicit information about new users by having them provide answers to interview questions. In this paper, we present Matrix Factorization K-Means (MFK), a novel method to solve the problem of interview question construction. MFK first learns the latent features of the user and the item through observed rating data and then determines the best interview questions based on the clusters of latent features. We can determine similar groups of users after obtaining the responses to the interview questions. Such recommendation systems can indicate new users’ tastes according to their responses to the interview questions. In our experiments, we evaluate our methods using a public dataset for recommendations. The results show that our method leads to better performance than other baselines.

Ming-Chu Chen, Hung-Yu Kao

Classification

Supervised Nonlinear Factorizations Excel In Semi-supervised Regression

Semi-supervised learning is an eminent domain of machine learning focusing on real-life problems where the labeled data instances are scarce. This paper innovatively extends existing factorization models into a supervised nonlinear factorization. The current state of the art methods for semi-supervised regression are based on supervised manifold regularization. In contrast, the latent data constructed by the proposed method jointly reconstructs both the observed predictors and target variables via generative-style nonlinear functions. Dual-form solutions of the nonlinear functions and a stochastic gradient descent technique which learns the low dimensionality data are introduced. The validity of our method is demonstrated in a series of experiments against five state-of-art baselines, clearly improving the prediction accuracy in eleven real-life data sets.

Josif Grabocka, Erind Bedalli, Lars Schmidt-Thieme

Learning from Crowds under Experts’ Supervision

Crowdsourcing services have been proven efficient in collecting large amount of labeled data for supervised learning, but low cost of crowd workers leads to unreliable labels. Various methods have been proposed to infer the ground truth or learn from crowd data directly though, there is no guarantee that these methods work well for highly biased or noisy crowd labels. Motivated by this limitation of crowd data, we propose to improve the performance of crowdsourcing learning tasks with some additional expert labels by treating each labeler as a personal classifier and combining all labelers’ opinions from a model combination perspective. Experiments show that our method can significantly improve the learning quality as compared with those methods solely using crowd labels.

Qingyang Hu, Qinming He, Hao Huang, Kevin Chiew, Zhenguang Liu

A Robust Classifier for Imbalanced Datasets

Imbalanced dataset classification is a challenging problem, since many classifiers are sensitive to class distribution so that the classifiers’ prediction has bias towards majority class. Hellinger Distance has been proven that it is skew-insensitive and the decision trees that employ Hellinger Distance as a splitting criterion have shown better performance than other decision trees based on Information Gain. We propose a new decision tree induction classifier (HeDEx) based on Hellinger Distance that is randomized ensemble trees selecting both attribute and split-point at random. We also propose hyperplane as a decision surface for HeDEx to improve the performance. A new pattern-based oversampling method is also proposed in this paper to reduce the bias towards majority class. The patterns are detected from HeDEx and the new instances generated are applied after verification process using Hellinger Distance Decision Trees. Our experiments show that the proposed methods show performance improvements on imbalanced datasets over the state-of-the-art Hellinger Distance Decision Trees.

Sori Kang, Kotagiri Ramamohanarao

Signed-Error Conformal Regression

This paper suggets a modification of the Conformal Prediction framework for regression that will strenghten the associated guarantee of validity. We motivate the need for this modification and argue that our conformal regressors are more closely tied to the actual error distribution of the underlying model, thus allowing for more natural interpretations of the prediction intervals. In the experimentation, we provide an empirical comparison of our conformal regressors to traditional conformal regressors and show that the proposed modification results in more robust two-tailed predictions, and more efficient one-tailed predictions.

Henrik Linusson, Ulf Johansson, Tuve Löfström

Multi-Instance Learning from Positive and Unlabeled Bags

Many methods exist to solve multi-instance learning by using different mechanisms, but all these methods require that both positive and negative bags are provided for learning. In reality, applications may only have positive samples to describe users’ learning interests and remaining samples are unlabeled (which may be positive, negative, or irrelevant to the underlying learning task). In this paper, we formulate this problem as positive and unlabeled multi-instance learning (puMIL). The main challenge of puMIL is to accurately identify negative bags for training discriminative classification models. To solve the challenge, we assign a weight value to each bag, and use an Artificial Immune System based self-adaptive process to select most reliable negative bags in each iteration. For each bag, a most positive instance (for a positive bag) or a least negative instance (for an identified negative bag) is selected to form a positive margin pool (PMP). A weighted kernel function is used to calculate pairwise distances between instances in the PMP, with the distance matrix being used to learn a support vector machines classifier. A test bag is classified as positive if one or multiple instances inside the bag are classified as positive, and negative otherwise. Experiments on real-world data demonstrate the algorithm performance.

Jia Wu, Xingquan Zhu, Chengqi Zhang, Zhihua Cai

Mining Contrast Subspaces

In this paper, we tackle a novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes

and

−

and a query object

, we want to find top-

subspaces

that maximize the ratio of likelihood of

against that in

−

. We demonstrate that this problem has important applications, and at the same time, is very challenging. It even does not allow polynomial time approximation. We present CSMiner, a mining method with various pruning techniques. CSMiner is substantially faster than the baseline method. Our experimental results on real data sets verify the effectiveness and efficiency of our method.

Lei Duan, Guanting Tang, Jian Pei, James Bailey, Guozhu Dong, Akiko Campbell, Changjie Tang

Forward Classification on Data Streams

In this paper, we explore a new research problem of predicting an incoming classifier on dynamic data streams, named as

forward classification

. The state-of-the-art classification models on data streams, such as the incremental and ensemble models, fall into the

retrospective classification

category where models used for classification are built from past observed stream data and constantly lag behind the incoming unobserved test data. As a result, the classification model and test data are temporally inconsistent, leading to severe performance deterioration when the concept (joint probability distribution) evolves rapidly. To this end, we propose a new

forward classification

method which aims to build the classification model which fits the current data. Specifically, forward classification first predicts the incoming classifier based on a line of recent classifiers, and then uses the predicted classifier to classify current data chunk. A learning framework which can adaptively switch between forward classification and retrospective classification is also proposed. Empirical studies on both synthetic and real-world data streams demonstrate the utility of the proposed method.

Peng Wang, Peng Zhang, Yanan Cao, Li Guo, Bingxing Fang

Shingled Graph Disassembly: Finding the Undecideable Path

A probabilistic finite state machine approach to statically disassembling x86 machine language programs is presented and evaluated. Static disassembly is a crucial prerequisite for software reverse engineering, and has many applications in computer security and binary analysis. The general problem is provably undecidable because of the heavy use of unaligned instruction encodings and dynamically computed control flows in the x86 architecture. Limited work in machine learning and data mining has been undertaken on this subject. This paper shows that semantic meanings of opcode sequences can be leveraged to infer similarities between groups of opcode and operand sequences. This empowers a probabilistic finite state machine to learn statistically significant opcode and operand sequences in a training corpus of disassemblies. The similarities demonstrate the statistical significance of opcodes and operands in a surrounding context, facilitating more accurate disassembly of new binaries. Empirical results demonstrate that the algorithm is more efficient and effective than comparable approaches used by state-of-the-art disassembly tools.

Richard Wartell, Yan Zhou, Kevin W. Hamlen, Murat Kantarcioglu

Collective Matrix Factorization of Predictors, Neighborhood and Targets for Semi-supervised Classification

Due to the small size of available labeled data for semi-supervised learning, approaches to this problem make strong assumptions about the data, performing well only when such assumptions hold true. However, a lot of effort may have to be spent in understanding the data so that the most suitable model can be applied. This process can be as critical as gathering labeled data. One way to overcome this hindrance is to control the contribution of different assumptions to the model, rendering it capable of performing reasonably in a wide range of applications. In this paper we propose a collective matrix factorization model that simultaneously decomposes the predictor, neighborhood and target matrices (PNT-CMF) to achieve semi-supervised classification. By controlling how strongly the model relies on different assumptions, PNT-CMF is able to perform well on a wider variety of datasets. Experiments on synthetic and real world datasets show that, while state-of-the-art models (TSVM and LapSVM) excel on datasets that match their characteristics and have a performance drop on the others, our approach outperforms them being consistently competitive in different situations.

Lucas Rego Drumond, Lars Schmidt-Thieme, Christoph Freudenthaler, Artus Krohn-Grimberghe

Ranking Tweets with Local and Global Consistency Using Rich Features

Ranking tweets is more challenging in Microblog search because of content sparseness and lack of context. Traditional ranking methods essentially using Euclidean distance are limited to local structure. Manifold structure helps to rank with local and global consistency. However such structure is empirically built on content similarity in an unsupervised way, suffering from sparseness while being adopted in tweet ranking. In this paper, we explore rich features to alleviate content sparseness problem, where time locality feature is proposed to consider context dependency. We then propose a supervised learning model that aggregates rich features to construct manifold structure. A learning algorithm is then designed for solving the model by minimizing the loss of labeled queries. At last we conduct a series of experiments to demonstrate the performance on 109 labeled queries from TREC Microblogging. Compared with the well-known baselines and empirical manifold structure, our algorithm achieves consistent improvements on the metrics.

Zhankun Huang, Shenghua Liu, Pan Du, Xueqi Cheng

Graph and Network Mining

Fast Triangle Core Decomposition for Mining Large Graphs

Large triangle cores represent dense subgraphs for which each edge has at least

− 2 triangles (same as cliques). This paper presents a fast algorithm for computing the triangle core decomposition on big graphs. The proposed triangle core algorithm adapts both the computations and representation based on the properties of the graph. In addition, we develop a fast edge-based parallel triangle counting algorithm, which lies at the heart of the triangle core decomposition. The proposed algorithm is orders of magnitude faster than the currently available approach. We also investigate and propose fast methods for two variants of the triangle core problem: computing only the top-k triangle cores fast and finding the maximum triangle core number of the graph. The experiments demonstrate the scalability and effectiveness of our approach on 150+ networks with up to 1.8 billion-edges. Further, we apply the proposed methods for graph mining tasks including finding dense subgraphs, temporal strong components, and maximum cliques.

Ryan A. Rossi

Super-Graph Classification

Graphs are popularly used to represent objects with dependency structures, yet all existing graph classification algorithms can only handle simple graphs where each node is a single attribute (or a set of independent attributes). In this paper, we formulate a new super-graph classification task where each node of the super-graph may contain a graph (a single-attribute graph), so a super-graph contains a set of inter-connected graphs. To support super-graph classification, we propose a

Weighted Random Walk Kernel

(WRWK) which generates a product graph between any two super-graphs, and uses the similarity (kernel value) of two single-attribute graph as the node weight. Then we calculate weighted random walks on the product graph to generate kernel value between two super-graphs as their similarity. Our method enjoys sound theoretical properties, including bounded similarity. Experiments confirm that our method significantly outperforms baseline approaches.

Ting Guo, Xingquan Zhu

Subtopic Mining via Modifier Graph Clustering

Understanding the information need encoded in a user query has long been regarded as a crucial step of effective information retrieval. In this paper, we focus on

subtopic mining

that aims at generating a ranked list of subtopic strings for a given topic. We propose the

modifier graph

based approach, under which the problem of subtopic mining reduces to that of graph clustering over the modifier graph. Compared with the existing methods, the experimental results show that our modifier-graph based approaches are robust to the sparseness problem. In particular, our approaches that perform subtopic mining at a fine-grained term-level outperform the baseline methods that perform subtopic mining at a whole query-level in terms of

I-rec

D-nDCG

and

D#-nDCG

Hai-Tao Yu, Fuji Ren

Net-Ray: Visualizing and Mining Billion-Scale Graphs

How can we visualize billion-scale graphs? How to spot outliers in such graphs quickly? Visualizing graphs is the most direct way of understanding them; however, billion-scale graphs are very difficult to visualize since the amount of information overflows the resolution of a typical screen.

In this paper we propose

Net-Ray

, an open-source package for visualizationbased mining on

billion-scale

graphs.

Net-Ray

visualizes graphs using the spy plot (adjacency matrix patterns), distribution plot, and correlation plot which involve careful node ordering and scaling. In addition,

Net-Ray

efficiently summarizes scatter clusters of graphs in a way that finds outliers automatically, and makes it easy to interpret them visually.

Extensive experiments show that

Net-Ray

handles very large graphs with billions of nodes and edges efficiently and effectively. Specifically, among the various datasets that we study, we visualize in multiple ways the YahooWeb graph which spans 1.4 billion webpages and 6.6 billion links, and the Twitter whofollows- whom graph, which consists of 62.5 million users and 1.8 billion edges. We report interesting clusters and outliers spotted and summarized by

Net-Ray

U. Kang, Jay-Yoon Lee, Danai Koutra, Christos Faloutsos

Structure-Aware Distance Measures for Comparing Clusterings in Graphs

Clustering in graphs aims to group vertices with similar patterns of connections. Applications include discovering communities and latent structures in graphs. Many algorithms have been proposed to find graph clusterings, but an open problem is the need for suitable comparison measures to quantitatively validate these algorithms, performing consensus clustering and to track evolving (graph) clusters across time. To date, most comparison measures have focused on comparing the vertex groupings, and completely ignore the difference in the structural approximations in the clusterings, which can lead to counter-intuitive comparisons. In this paper, we propose new measures that account for differences in the approximations. We focus on comparison measures for two important graph clustering approaches, community detection and blockmodelling, and propose comparison measures that work for weighted (and unweighted) graphs.

Jeffrey Chan, Nguyen Xuan Vinh, Wei Liu, James Bailey, Christopher A. Leckie, Kotagiri Ramamohanarao, Jian Pei

Efficiently and Fast Learning a Fine-grained Stochastic Blockmodel from Large Networks

Stochastic blockmodel (SBM) has recently come into the spotlight in the domains of social network analysis and statistical machine learning, as it enables us to decompose and then analyze an exploratory network without knowing any priori information about its intrinsic structure. However, the prohibitive computational cost limits SBM learning algorithm with the capability of model selection to small network with hundreds of nodes. This paper presents a fine-gained SBM and its fast learning algorithm, named FSL, which ingeniously combines the component-wise EM (CEM) algorithm and minimum message length (MML) together to achieve the parallel learning of parameter estimation and model evaluation. The FSL significantly reduces the time complexity of the learning algorithm, and scales to network with thousands of nodes. The experimental results indicate that the FSL can achieve the best tradeoff between effectiveness and efficiency through greatly reducing learning time while preserving competitive learning accuracy. Moreover, it is noteworthy that our proposed method shows its excellent generalization ability through the application of link prediction.

Xuehua Zhao, Bo Yang, Hechang Chen

Influence Propagation: Patterns, Model and a Case Study

When a free, catchy application shows up, how quickly will people notify their friends about it? Will the enthusiasm drop exponentially with time, or oscillate? What other patterns emerge?

Here we answer these questions using data from the Polly telephone-based application, a large influence network of 72,000 people, with about 173,000 interactions, spanning

500MB

of log data and

200 GB

of audio data.

We report surprising patterns, the most striking of which are: (a) the

Fizzle

pattern, i.e., excitement about Polly shows a power-law decay over time with exponent of -1.2; (b) the

Rendezvous

pattern, that obeys a power law (we explain

Rendezvous

in the text); (c) the

Dispersion

pattern, we find that the more a person uses Polly, the fewer friends he will use it with, but in a reciprocal fashion.

Finally, we also propose a generator of influence networks, which generate networks that mimic our discovered patterns

Yibin Lin, Agha Ali Raza, Jay-Yoon Lee, Danai Koutra, Roni Rosenfeld, Christos Faloutsos

Overlapping Communities for Identifying Misbehavior in Network Communications

In this paper, we study the problem of identifying misbehaving network communications using community detection algorithms. Recently, it was shown that identifying the communications that do not respect community boundaries is a promising approach for network intrusion detection. However, it was also shown that traditional community detection algorithms are not suitable for this purpose.

In this paper, we propose a novel method for enhancing community detection algorithms, and show that contrary to previous work, they provide a good basis for network misbehavior detection. This enhancement extends disjoint communities identified by these algorithms with a layer of auxiliary communities, so that the boundary nodes can belong to several communities. Although non-misbehaving nodes can naturally be in more than one community, we show that the majority of misbehaving nodes belong to multiple overlapping communities, therefore overlapping community detection algorithms can also be deployed for intrusion detection.

Finally, we present a framework for anomaly detection which uses community detection as its basis. The framework allows incorporation of application-specific filters to reduce the false positives induced by community detection algorithms. Our framework is validated using large

email networks

and

flow graphs

created from real network traffic.

Farnaz Moradi, Tomas Olovsson, Philippas Tsigas

Fault-Tolerant Concept Detection in Information Networks

Given information about medical drugs and their properties, how can we automatically discover that Aspirin has blood-thinning properties, and thus prevents heart attacks? Expressed in more general terms, if we have a large information network that integrates data from heterogeneous data sources, how can we extract semantic information that provides a better understanding of the integrated data and also helps us to identify missing links? We propose to extract concepts that describe groups of objects and their common properties from the integrated data. The discovered concepts provide semantic information as well as an abstract view on the integrated data and thus improve the understanding of complex systems. Our proposed method has the following desirable properties: (a) it is

parameter-free

and therefore requires no user-defined parameters (b) it is

fault-tolerant

, allowing for the detection of missing links and (c) it is

scalable

, being linear on the input size.We demonstrate the effectiveness and scalability of the proposed method on real, publicly available graphs

Tobias Kötter, Stephan Günnemann, Michael R. Berthold, Christos Faloutsos

Characterizing Temporal Anomalies in Evolving Networks

Many real world networks evolve over time indicating their dynamic nature to cope up with the changing real life scenarios. Detection of various categories of anomalies, also known as outliers, in graph representation of such network data is essential for discovering different irregular connectivity patterns with potential adverse effects such as intrusions into a computer network. Characterizing the behavior of such anomalies (outliers) during the evolution of the network over time is critical for their mitigation. In this context, a novel method for an effective characterization of network anomalies is proposed here by defining various categories of graph outliers depending on their temporal behavior noticeable across multiple instances of a network during its evolution. The efficacy of the proposed method is demonstrated through an experimental evaluation using various benchmark graph data sets.

N. N. R. Ranga Suri, M. Narasimha Murty, G. Athithan

Applications

An Integrated Model for User Attribute Discovery: A Case Study on Political Affiliation Identification

Discovering user demographic attributes from social media is a problem of considerable interest. The problem setting can be generalized to include three components — users, topics and behaviors. In recent studies on this problem, however, the behavior between users and topics are not effectively incorporated. In our work, we proposed an integrated unsupervised model which takes into consideration all the three components integral to the task. Furthermore, our model incorporates collaborative filtering with probabilistic matrix factorization to solve the data sparsity problem, a computational challenge common to all such tasks. We evaluated our method on a case study of user political affiliation identification, and compared against state-of-the-art baselines. Our model achieved an accuracy of 70.1% for user party detection task.

Swapna Gottipati, Minghui Qiu, Liu Yang, Feida Zhu, Jing Jiang

Programmatic Buying Bidding Strategies with Win Rate and Winning Price Estimation in Real Time Mobile Advertising

A major trend in mobile advertising is the emergence of real time bidding (RTB) based marketplaces on the supply side and the corresponding programmatic impression buying on the demand side. In order to acquire the most relevant audience impression at the lowest cost, a demand side player has to accurately estimate the win rate and winning price in the auction, and incorporate that knowledge in its bid. In this paper, we describe our battle-proven techniques of predicting win rate and winning price in RTB, and the corresponding bidding strategies built on top of those predictions. We also reveal the close relationship between the win rate and winning price estimation, and demonstrate how to solve the two problems together. All of our estimation methods are developed with distributed framework and have been applied to billion order numbers of data in real business operation.

Xiang Li, Devin Guan

Self-training Temporal Dynamic Collaborative Filtering

Recommender systems (RS) based on collaborative filtering (CF) is traditionally incapable of modeling the often non-linear and non Gaussian tendency of user taste and product attractiveness leading to unsatisfied performance. Particle filtering, as a dynamic modeling method, enables tracking of such tendency. However, data are often extremely sparse in real-world RS under temporal context, resulting in less reliable tracking. Approaches to such problem seek additional information or impute all or most missing data to reduce sparsity, which then causes scalability problems for particle filtering. In this paper, we develop a novel semi-supervised method to simultaneously solve the problems of data sparsity and scalability in a particle filtering based dynamic recommender system. Specifically, it exploits the self-training principle to dynamically construct observations based on current prediction distributions. The proposed method is evaluated on two public benchmark datasets, showing significant improvement over a variety of existing methods for top-k recommendation in both accuracy and scalability.

Cheng Luo, Xiongcai Cai, Nipa Chowdhury

Data Augmented Maximum Margin Matrix Factorization for Flickr Group Recommendation

User groups on photo sharing websites, such as Flickr, are self-organized communities to share photos and conversations with similar interest and have gained massive popularity. However, the huge volume of groups brings troubles for users to decide which group to choose. Further, directly applying collaborative filtering techniques to group recommendation will suffer from

cold start

problem since many users do not affiliate to any group. In this paper, we propose a hybrid recommendation approach named Data Augmented Maximum Margin Matrix Factorization (DAM

F), by integrating collaborative user-group information and user similarity graph. Specifically, Maximum Margin Matrix Factorization (MMMF) is employed for the collaborative recommendation, while the user similarity graph obtained from the user uploaded images and annotated tags is used as an complementary part to handle the

cold start

problem and to improve the performance of MMMF. The experiments conducted on our crawled dataset with 2196 users, 985 groups and 334467 images from

Flickr

demonstrate the effectiveness of the proposed approach.

Liang Chen, Yilun Wang, Tingting Liang, Lichuan Ji, Jian Wu

A Graph Matching Method for Historical Census Household Linkage

Linking historical census data across time is a challenging task due to various reasons, including data quality, limited individual information, and changes to households over time. Although most census data linking methods link records that correspond to individual household members, recent advances show that linking households as a whole provide more accurate results and less multiple household links. In this paper, we introduce a graph-based method to link households, which takes the structural relationship between household members into consideration. Based on individual record linking results, our method builds a graph for each household, so that the matches are determined by both attribute-level and record-relationship similarity. Our experimental results on both synthetic and real historical census data have validated the effectiveness of this method. The proposed method achieves an F-measure of 0.937 on data extracted from real UK census datasets, outperforming all alternative methods being compared.

Zhichun Fu, Peter Christen, Jun Zhou

Intervention-Driven Predictive Framework for Modeling Healthcare Data

Assessing prognostic risk is crucial to clinical care, and critically dependent on both diagnosis and medical interventions. Current methods use this augmented information to build a single prediction rule. But this may not be expressive enough to capture differential effects of interventions on prognosis. To this end, we propose a supervised, Bayesian nonparametric framework that simultaneously discovers the latent intervention groups and builds a separate prediction rule for each intervention group. The prediction rule is learnt using diagnosis data through a Bayesian logistic regression. For inference, we develop an efficient collapsed Gibbs sampler. We demonstrate that our method outperforms baselines in predicting 30-day hospital readmission using two patient cohorts - Acute Myocardial Infarction and Pneumonia. The significance of this model is that it can be applied widely across a broad range of medical prognosis tasks.

Santu Rana, Sunil Kumar Gupta, Dinh Phung, Svetha Venkatesh

Visual Analysis of Uncertainty in Trajectories

Mining trajectory datasets has many important applications. Real trajectory data often involve uncertainty due to inadequate sampling rates and measurement errors. For some trajectories, their precise positions cannot be recovered and the exact routes that vehicles traveled cannot be accurately reconstructed. In this paper, we investigate the uncertainty problem in trajectory data and present a visual analytics system to reveal, analyze, and solve the uncertainties associated with trajectory samples. We first propose two novel visual encoding schemes called the

road map analyzer

and the

uncertainty lens

for discovering road map errors and visually analyzing the uncertainty in trajectory data respectively. Then, we conduct three case studies to discover the map errors, to address the ambiguity problem in map-matching, and to reconstruct the trajectories with historical data. These case studies demonstrate the capability and effectiveness of our system.

Lu Lu, Nan Cao, Siyuan Liu, Lionel Ni, Xiaoru Yuan, Huamin Qu

Activity Recognition Using a Few Label Samples

Sensor-based human activity recognition aims to automatically identify human activities from a series of sensor observations, which is a crucial task for supporting wide range applications. Typically, given sufficient training examples for all activities (or activity classes), supervised learning techniques have been applied to build a classification model using sufficient training samples for differentiating various activities. However, it is often impractical to manually label large amounts of training data for each individual activities. As such, semi-supervised learning techniques sound promising alternatives as they have been designed to utilize a small training set

, enhanced by a large unlabeled set

. However, we observe that directly applying semi-supervised learning techniques may not produce accurate classification. In this paper, we have designed a novel

dynamic temporal

extension

technique to extend

into a bigger training set, and then build a final semi-supervised learning model for more accurate classification. Extensive experiments demonstrate that our proposed technique outperforms existing 7 state-of-the-art supervised learning and semi-supervised learning techniques.

Heidar Davoudi, Xiao-Li Li, Nguyen Minh Nhut, Shonali Priyadarsini Krishnaswamy

A Framework for Large-Scale Train Trip Record Analysis and Its Application to Passengers’ Flow Prediction after Train Accidents

We have constructed a framework for analyzing passenger behaviors in public transportation systems as understanding these variables is a key to improving the efficiency of public transportation. It uses a large-scale dataset of trip records created from smart card data to estimate passenger flows in a complex metro network. Its interactive flow visualization function enables various unusual phenomena to be observed. We propose a predictive model of passenger behavior after a train accident. Evaluation showed that it can accurately predict passenger flows after a major train accident. The proposed framework is the first step towards real-time observation and prediction for public transportation systems.

Daisaku Yokoyama, Masahiko Itoh, Masashi Toyoda, Yoshimitsu Tomita, Satoshi Kawamura, Masaru Kitsuregawa

Patent Evaluation Based on Technological Trajectory Revealed in Relevant Prior Patents

It is a challenging task for firms to assess the importance of a patent and identify valuable patents as early as possible. Counting the number of citations received is a widely used method to assess the value of a patent. However, recently granted patents have few citations received, which makes the use of citation counts infeasible. In this paper, we propose a novel idea to evaluate the value of new or recently granted patents using

recommended relevant prior patents

. Our approach is to exploit trends in temporal patterns of relevant prior patents, which are highly related to patent values. We evaluate the proposed approach using two patent value evaluation tasks with a large-scale collection of U.S. patents. Experimental results show that the models created based on our idea significantly enhance those using the baseline features or patent backward citations.

Sooyoung Oh, Zhen Lei, Wang-Chien Lee, John Yen

Privacy Preserving

Deferentially Private Tagging Recommendation Based on Topic Model

Tagging recommender system

allows Internet users to annotate resources with personalized tags and provides users the freedom to obtain recommendations. However, It is usually confronted with serious privacy concerns, because adversaries may re-identify a user and her/his sensitive tags with only a little background information. This paper proposes a privacy preserving tagging release algorithm,

PriTop

, which is designed to protect users under the notion of

differential privacy

. The proposed

PriTop

algorithm includes three privacy preserving operations:

Private Topic Model Generation

structures the uncontrolled tags,

Private Weight Perturbation

adds

Laplace

noise into the weights to hide the numbers of tags; while

Private Tag Selection

finally finds the most suitable replacement tags for the original tags. We present extensive experimental results on four real world datasets and results suggest the proposed

PriTop

algorithm can successfully retain the utility of the datasets while preserving privacy.

Tianqing Zhu, Gang Li, Wanlei Zhou, Ping Xiong, Cao Yuan

Matrix Factorization without User Data Retention

Recommender systems often rely on a centralized storage of user data and there is a growing concern about the misuse of the data. As recent studies have demonstrated, sensitive information could be inferred from user ratings stored by recommender systems. This work presents a novel semi-decentralized variant of the widely-used Matrix Factorization (MF) technique. The proposed approach eliminates the need for retaining user data, such that neither user ratings nor latent user vectors are stored by the recommender. Experimental evaluation indicates that the performance of the proposed approach is close to that of standard MF, and that the gap between the two diminishes as more user data becomes available. Our work paves the way to a new type of MF recommenders, which avoid user data retention, yet are capable of achieving accuracy similar to that of the state-of-the-art recommenders.

David Vallet, Arik Friedman, Shlomo Berkovsky

Privacy-Preserving Collaborative Anomaly Detection for Participatory Sensing

In collaborative anomaly detection, multiple data sources submit their data to an on-line service, in order to detect anomalies with respect to the wider population. A major challenge is how to achieve reasonable detection accuracy without disclosing the actual values of the participants’ data. We propose a lightweight and scalable privacy-preserving collaborative anomaly detection scheme called Random Multiparty Perturbation (RMP), which uses a combination of nonlinear and participant-specific linear perturbation. Each participant uses an individually perturbed uniformly distributed random matrix, in contrast to existing approaches that use a common random matrix. A privacy analysis is given for Bayesian Estimation and Independent Component Analysis attacks. Experimental results on real and synthetic datasets using an auto-encoder show that RMP yields comparable results to non-privacy preserving anomaly detection.

Sarah M. Erfani, Yee Wei Law, Shanika Karunasekera, Christopher A. Leckie, Marimuthu Palaniswami

Privacy Preserving Publication of Locations Based on Delaunay Triangulation

The pervasive usage of LBS (Location Based Services) has caused serious risk of personal privacy. In order to preserve the privacy of locations, only the anonymized or perturbated data are published. At the same time, the data mining results for the perturbated data should keep as close as possible to the data mining results for the original data. In this paper, we propose a novel perturbation method such that the Delaunay triangulation of the perturbated data is the same as that of the original data. Theoretically, the Delaunay triangulation of point data set presents the neighborhood relationships of points. Many data mining algorithms strongly depend on the neighborhood relationships of points. Our method is proved to be effective and efficient by performing several popular data mining algorithms such as KNN, K-means, DBSCAN.

Jun Luo, Jinfei Liu, Li Xiong

A Fast Secure Dot Product Protocol with Application to Privacy Preserving Association Rule Mining

Data mining often causes privacy concerns. To ease the concerns, various privacy preserving data mining techniques have been proposed. However, those techniques are often too computationally intensive to be deployed in practice. Efficiency becomes a major challenge in privacy preserving data mining. In this paper we present an efficient secure dot product protocol and show its application in privacy preserving association rule mining, one of the most widely used data mining techniques. The protocol is orders of magnitude faster than previous protocols because it employs mostly cheap cryptographic operations, e.g. hashing and modular multiplication. The performance has been further improved by parallelization. We implemented the protocol and tested the performance. The test result shows that on moderate commodity hardware, the dot product of two vectors of size 1 million can be computed within 1 minute. As a comparison, the currently most widely used protocol needs about 1 hour and 23 minutes.

Changyu Dong, Liqun Chen

Backmatter

Title: Advances in Knowledge Discovery and Data Mining
Editors: Vincent S. Tseng
Tu Bao Ho
Zhi-Hua Zhou
Arbee L. P. Chen
Hung-Yu Kao
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-06608-0
Print ISBN: 978-3-319-06607-3
DOI: https://doi.org/10.1007/978-3-319-06608-0

Springer Professional

About this book

Table of Contents

Frontmatter

Pattern Mining

MalSpot: Multi2 Malicious Network Behavior Patterns Analysis

Extracting Diverse Patterns with Unbalanced Concept Hierarchy

Efficiently Depth-First Minimal Pattern Mining

Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information

Social Network and Social Media

Balancing the Analysis of Frequent Patterns

Balanced Seed Selection for Budgeted Influence Maximization in Social Networks

Persistent Community Detection in Dynamic Social Networks

Detecting and Analyzing Influenza Epidemics with Social Media in China

Analyzing Location Predictability on Location-Based Social Networks

ReadBehavior: Reading Probabilities Modeling of Tweets via the Users’ Retweeting Behaviors

Inferring Strange Behavior from Connectivity Pattern in Social Networks

Inferring Attitude in Online Social Networks Based on Quadratic Correlation

Hash-Based Stream LDA: Topic Modeling in Social Streams

Two Sides of a Coin: Separating Personal Communication and Public Dissemination Accounts in Twitter

Latent Features Based Prediction on New Users’ Tastes

Classification

Supervised Nonlinear Factorizations Excel In Semi-supervised Regression

Learning from Crowds under Experts’ Supervision

A Robust Classifier for Imbalanced Datasets

Signed-Error Conformal Regression

Multi-Instance Learning from Positive and Unlabeled Bags

Mining Contrast Subspaces

Forward Classification on Data Streams

Shingled Graph Disassembly: Finding the Undecideable Path

Collective Matrix Factorization of Predictors, Neighborhood and Targets for Semi-supervised Classification

Ranking Tweets with Local and Global Consistency Using Rich Features

Graph and Network Mining

Fast Triangle Core Decomposition for Mining Large Graphs

Super-Graph Classification

Subtopic Mining via Modifier Graph Clustering

Net-Ray: Visualizing and Mining Billion-Scale Graphs

Structure-Aware Distance Measures for Comparing Clusterings in Graphs

Efficiently and Fast Learning a Fine-grained Stochastic Blockmodel from Large Networks

Influence Propagation: Patterns, Model and a Case Study

Overlapping Communities for Identifying Misbehavior in Network Communications

Fault-Tolerant Concept Detection in Information Networks

Characterizing Temporal Anomalies in Evolving Networks

Applications

An Integrated Model for User Attribute Discovery: A Case Study on Political Affiliation Identification

Programmatic Buying Bidding Strategies with Win Rate and Winning Price Estimation in Real Time Mobile Advertising

Self-training Temporal Dynamic Collaborative Filtering

Data Augmented Maximum Margin Matrix Factorization for Flickr Group Recommendation

A Graph Matching Method for Historical Census Household Linkage

Intervention-Driven Predictive Framework for Modeling Healthcare Data

Visual Analysis of Uncertainty in Trajectories

Activity Recognition Using a Few Label Samples

A Framework for Large-Scale Train Trip Record Analysis and Its Application to Passengers’ Flow Prediction after Train Accidents

Patent Evaluation Based on Technological Trajectory Revealed in Relevant Prior Patents

Privacy Preserving

Deferentially Private Tagging Recommendation Based on Topic Model

Matrix Factorization without User Data Retention

Privacy-Preserving Collaborative Anomaly Detection for Participatory Sensing

Privacy Preserving Publication of Locations Based on Delaunay Triangulation

A Fast Secure Dot Product Protocol with Application to Privacy Preserving Association Rule Mining

Backmatter

Premium Partner