Skip to main content

2015 | Buch

Advances in Knowledge Discovery and Data Mining

19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part I

herausgegeben von: Tru Cao, Ee-Peng Lim, Zhi-Hua Zhou, Tu-Bao Ho, David Cheung, Hiroshi Motoda

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume set, LNAI 9077 + 9078, constitutes the refereed proceedings of the 19th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2015, held in Ho Chi Minh City, Vietnam, in May 2015.

The proceedings contain 117 paper carefully reviewed and selected from 405 submissions. They have been organized in topical sections named: social networks and social media; classification; machine learning; applications; novel methods and algorithms; opinion mining and sentiment analysis; clustering; outlier and anomaly detection; mining uncertain and imprecise data; mining temporal and spatial data; feature extraction and selection; mining heterogeneous, high-dimensional and sequential data; entity resolution and topic-modeling; itemset and high-performance data mining; and recommendations.

Inhaltsverzeichnis

Frontmatter

Social Networks and Social Media

Frontmatter
Maximizing Friend-Making Likelihood for Social Activity Organization

The social presence theory in social psychology suggests that computer-mediated online interactions are inferior to face-to-face, in-person interactions. In this paper, we consider the scenarios of organizing in person friend-making social activities via online social networks (OSNs) and formulate a new research problem, namely, Hop-bounded Maximum Group Friending (HMGF), by modeling both existing friendships and the likelihood of new friend making. To find a set of attendees for socialization activities, HMGF is unique and challenging due to the interplay of the group size, the constraint on existing friendships and the objective function on the likelihood of friend making. We prove that HMGF is NP-Hard, and no approximation algorithm exists unless

$$P=NP$$

. We then propose an error-bounded approximation algorithm to efficiently obtain the solutions very close to the optimal solutions. We conduct a user study to validate our problem formulation and perform extensive experiments on real datasets to demonstrate the efficiency and effectiveness of our proposed algorithm.

Chih-Ya Shen, De-Nian Yang, Wang-Chien Lee, Ming-Syan Chen
What Is New in Our City? A Framework for Event Extraction Using Social Media Posts

Post streams from public social media platforms such as Instagram and Twitter have become precious but noisy data sources to discover what is happening around us. In this paper, we focus on the problem of detecting and presenting local events in real time using social media content. We propose a novel framework for real-time city event detection and extraction. The proposed framework first applies bursty detection to discover candidate event signals from Instagram and Twitter post streams. Then it integrates the two posts streams to extract features for candidate event signals and classifies them into true events or noise. For the true events, the framework extracts various information to summarize and present them. We also propose a novel method that combines text, image and geolocation information to retrieve relevant photos for detected events. Through the experiments on a large dataset, we show that integrating Instagram and Twitter post streams can improve event detection accuracy, and properly combining text, image and geolocation information is able to retrieve more relevant photos for events. Through case studies, we also show that the framework is able to report detected events with low spatial and temporal deviation.

Chaolun Xia, Jun Hu, Yan Zhu, Mor Naaman
Link Prediction in Aligned Heterogeneous Networks

Social networks develop rapidly and often contain heterogeneous information. When users join a new social network, recommendation affects their first impressions on this social network. Therefore link prediction for new users is significant. However, due to the lack of sufficient active data of new users in the new social network (target network), link prediction often encounters the cold start problem. In this paper, we attempt to solve the user-user link prediction problem for new users by utilizing data in a similar social network (source network). In order to bridge the two networks, three categories of local features related to single edge and one category of global features associated with multiple edges are selected. The Aligned Factor Graph (

AFG

) model is proposed for prediction, and

Aligned Structure Algorithm

is used to reduce the factor graph scale and keep the prediction performance at the same time. Experiments on two real social networks, i.e., Twitter and Foursquare show that

AFG

model works well when users leave little data in target network.

Fangbing Liu, Shu-Tao Xia
Scale-Adaptive Group Optimization for Social Activity Planning

Studies have shown that each person is more inclined to enjoy a group activity when 1) she is interested in the activity, and 2) many friends with the same interest join it as well. Nevertheless, even with the interest and social tightness information available in online social networks, nowadays many social group activities still need to be coordinated manually. In this paper, therefore, we first formulate a new problem, named Participant Selection for Group Activity (PSGA), to decide the group size and select proper participants so that the sum of personal interests and social tightness of the participants in the group is maximized, while the activity cost is also carefully examined. To solve the problem, we design a new randomized algorithm, named Budget-Aware Randomized Group Selection (BARGS), to optimally allocate the computation budgets for effective selection of the group size and participants, and we prove that BARGS can acquire the solution with a guaranteed performance bound. The proposed algorithm was implemented in Facebook, and experimental results demonstrate that social groups generated by the proposed algorithm significantly outperform the baseline solutions.

Hong-Han Shuai, De-Nian Yang, Philip S. Yu, Ming-Syan Chen
Influence Maximization Across Partially Aligned Heterogenous Social Networks

The influence maximization problem aims at finding a subset of seed users who can maximize the spread of influence in online social networks (OSNs). Existing works mostly focus on one single homogenous network. However, in the real world, OSNs (1) are usually heterogeneous, via which users can influence each others in multiple channels; and (2) share common users, via whom information could propagate across networks.

In this paper, for the first time we study the influence maximization problem in multiple partially aligned heterogenous OSNs. A new model, multi-aligned multi-relational network influence maximizer (M&M), is proposed to address this problem. M&M extracts multi-aligned multi-relational networks (MMNs) from aligned heterogeneous OSNs based on a set of inter and intra network social meta paths. Besides, M&M extends traditional linear threshold (LT) model to depict the information diffusion across MMNs. In addition, M&M, which selects seed users greedily, is proved to achieve a

$$(1-\frac{1}{e})$$

-approximation of the optimal solution. Extensive experiments conducted on two real-world partially aligned heterogeneous OSNs demonstrate its effectiveness.

Qianyi Zhan, Jiawei Zhang, Senzhang Wang, Philip S. Yu, Junyuan Xie
Multiple Factors-Aware Diffusion in Social Networks

Information diffusion is a natural phenomenon that information propagates from nodes to nodes over a social network. The behavior that a node adopts an information piece in a social network can be affected by different factors. Previously, many diffusion models are proposed to consider one or several fixed factors. The factors affecting the adoption decision of a node are different from one to another and may not be seen before. For a different scenario of diffusion with new factors, previous diffusion models may not model the diffusion well, or are not applicable at all. In this work, our aim is to design a diffusion model in which factors considered are flexible to extend and change. We further propose a framework of learning parameters of the model, which is independent of factors considered. Therefore, with different factors, our diffusion model can be adapted to more scenarios of diffusion without requiring the modification of the diffusion model and the learning framework. In the experiment, we show that our diffusion model is very effective on the task of activation prediction on a Twitter dataset.

Chung-Kuang Chou, Ming-Syan Chen
Understanding Community Effects on Information Diffusion

In social network research, community study is one flourishing aspect which leads to insightful solutions to many practical challenges. Despite the ubiquitous existence of communities in social networks and their properties of depicting users and links, they have not been explicitly considered in information diffusion models. Previous studies on social networks discovered that links between communities function differently from those within communities. However, no information diffusion model has yet considered how the community structure affects the diffusion process.

Motivated by this important absence, we conduct exploratory studies on the effects of communities in information diffusion processes. Our observations on community effects can help to solve many tasks in the studies of information diffusion. As an example, we show its application in solving one of the most important problems about information diffusion: the influence maximization problem. We propose a community-based fast influence (CFI) model which leverages the community effects on the diffusion of information and provides an effective approximate algorithm for the influence maximization problem.

Shuyang Lin, Qingbo Hu, Guan Wang, Philip S. Yu
On Burst Detection and Prediction in Retweeting Sequence

Message propagation via retweet chain can be regarded as a social contagion process. In this paper, we examine burst patterns in retweet activities. A burst is a large number of retweets of a particular tweet occurring within a certain short time window. The occurring of a burst indicates the original tweet receives abnormally high attentions during the burst period. It will be imperative to characterize burst patterns and develop algorithms to detect and predict bursts. We propose the use of the Cantelli’s inequality to identify bursts from retweet sequence data. We conduct a comprehensive empirical analysis of a large microblogging dataset collected from the Sina Weibo and report our observations of burst patterns. Based on our empirical findings, we extract various features from users’ profiles, followship topology, and message topics and investigate whether and how accurate we can predict bursts using classifiers based on the extracted features. Our empirical study of the Sina Weibo data shows the feasibility of burst prediction using appropriately extracted features and classic classifiers.

Zhilin Luo, Yue Wang, Xintao Wu, Wandong Cai, Ting Chen
#FewThingsAboutIdioms: Understanding Idioms and Its Users in the Twitter Online Social Network

To help users find popular topics of discussion, Twitter periodically publishes ‘trending topics’ (trends) which are the most discussed keywords (e.g., hashtags) at a certain point of time. Inspection of the trends over several months reveals that while most of the trends are related to events in the off-line world, such as popular television shows, sports events, or emerging technologies, a significant fraction are

not

related to any topic / event in the off-line world. Such trends are usually known as

idioms

, examples being #4WordsBeforeBreakup, #10thingsIHateAboutYou etc. We perform the first systematic measurement study on Twitter idioms. We find that tweets related to a particular idiom normally do not cluster around any particular topic or event. There are a set of users in Twitter who predominantly discuss idioms – common, not-so-popular, but active users who mostly use Twitter as a conversational platform – as opposed to other users who primarily discuss topical contents. The implication of these findings is that within a single online social network, activities of users may have very different semantics; thus, tasks like community detection and recommendation may not be accomplished perfectly using a single universal algorithm. Specifically, we run two (link-based and content-based) algorithms for community detection on the Twitter social network, and show that idiom oriented users get clustered better in one while topical users in the other. Finally, we build a novel service which shows trending idioms and recommends idiom users to follow.

Koustav Rudra, Abhijnan Chakraborty, Manav Sethi, Shreyasi Das, Niloy Ganguly, Saptarshi Ghosh
Retweeting Activity on Twitter: Signs of Deception

Given the re-broadcasts (i.e. retweets) of posts in Twitter, how can we spot fake from genuine user reactions? What will be the tell-tale sign — the connectivity of retweeters, their relative timing, or something else? High retweet activity indicates influential users, and can be monetized. Hence, there are strong incentives for fraudulent users to artificially boost their retweets’ volume. Here, we explore the identification of fraudulent and genuine retweet threads. Our main contributions are: (a) the discovery of

patterns

that fraudulent activity seems to follow (the “

triangles

” and “

homogeneity

” patterns, the formation of micro-clusters in appropriate feature spaces); and (b) “

RTGen

”, a realistic generator that mimics the behaviors of both honest and fraudulent users. We present experiments on a dataset of more than 6 million retweets crawled from Twitter.

Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Christos Faloutsos, Athena Vakali
Resampling-Based Gap Analysis for Detecting Nodes with High Centrality on Large Social Network

We address a problem of identifying nodes having a high centrality value in a large social network based on its approximation derived only from nodes sampled from the network. More specifically, we detect gaps between nodes with a given confidence level, assuming that we can say a gap exists between two adjacent nodes ordered in descending order of approximations of true centrality values if it can divide the ordered list of nodes into two groups so that any node in one group has a higher centrality value than any one in another group with a given confidence level. To this end, we incorporate confidence intervals of true centrality values, and apply the resampling-based framework to estimate the intervals as accurately as possible. Furthermore, we devise an algorithm that can efficiently detect gaps by making only two passes through the nodes, and empirically show, using three real world social networks, that the proposed method can successfully detect more gaps, compared to the one adopting a standard error estimation framework, using the same node coverage ratio, and that the resulting gaps enable us to correctly identify a set of nodes having a high centrality value.

Kouzou Ohara, Kazumi Saito, Masahiro Kimura, Hiroshi Motoda

Classification

Frontmatter
Double Ramp Loss Based Reject Option Classifier

The performance of a reject option classifiers is quantified using

$$0-d-1$$

loss where

$$d \in (0,.5)$$

is the loss for rejection. In this paper, we propose

double ramp loss

function which gives a continuous upper bound for

$$(0-d-1)$$

loss. Our approach is based on minimizing regularized risk under the double ramp loss using

difference of convex programming

. We show the effectiveness of our approach through experiments on synthetic and benchmark datasets. Our approach performs better than the state of the art reject option classification approaches.

Naresh Manwani, Kalpit Desai, Sanand Sasidharan, Ramasubramanian Sundararajan
Efficient Methods for Multi-label Classification

As a generalized form of multi-class classification, multi-label classification allows each sample to be associated with multiple labels. This task becomes challenging when the number of labels bulks up, which demands a high efficiency. Many approaches have been proposed to address this problem, among which one of the main ideas is to select a subset of labels which can approximately span the original label space, and training is performed only on the selected set of labels. However, these proposed sampling algorithms either require nondeterministic number of sampling trials or are time consuming. In this paper, we propose two label selection methods for multi-label classification (i) clustering based sampling (CBS) that uses deterministic number of sampling trials; and (ii) frequency based sampling (FBS) utilizing only label frequency statistics which makes it more efficient. Moreover, neither of these two algorithms needs to perform singular value decomposition (SVD) on label matrix which is used in previously mentioned approaches. Experiments are performed on several real world multi-label data sets with the number of labels ranging from hundreds to thousands, and it is shown that the proposed approaches achieve the state-of-the-art performance among label space reduction based multi-label classification algorithms.

Chonglin Sun, Chunting Zhou, Bo Jin, Francis C. M. Lau
A Coupled k-Nearest Neighbor Algorithm for Multi-label Classification

ML-

$$k$$

NN is a well-known algorithm for multi-label classification. Although effective in some cases, ML-

$$k$$

NN has some defect due to the fact that it is a binary relevance classifier which only considers one label every time. In this paper, we present a new method for multi-label classification, which is based on lazy learning approaches to classify an unseen instance on the basis of its

$$k$$

nearest neighbors. By introducing the coupled similarity between class labels, the proposed method exploits the correlations between class labels, which overcomes the shortcoming of ML-

$$k$$

NN. Experiments on benchmark data sets show that our proposed Coupled Multi-Label

$$k$$

Nearest Neighbor algorithm (CML-

$$k$$

NN) achieves superior performance than some existing multi-label classification algorithms.

Chunming Liu, Longbing Cao
Learning Topic-Oriented Word Embedding for Query Classification

In this paper, we propose a topic-oriented word embedding approach to address the query classification problem. First, the topic information is encoded to generate query categories. Then, the user click-through information is also incorporated in the modified word embedding algorithms. After that, the short and ambiguous queries are enriched to be classified in a supervised learning way. The unique contributions are that we present four neural network strategies based on the proposed model. The experiments are designed on two open data sets, namely Baidu and Sogou, which are two famous commercial search companies. Our evaluation results show that the proposed approach is promising on both large data sets. Under the four proposed strategies, we achieve the high performance as 95.73% in terms of Precision, 97.79% in terms of the F1 measure.

Hebin Yang, Qinmin Hu, Liang He
Reliable Early Classification on Multivariate Time Series with Numerical and Categorical Attributes

Early classification on multivariate time series has recently emerged as a novel and important topic in data mining fields with wide applications such as early detection of diseases in healthcare domains. Most of the existing studies on this topic focused only on univariate time series, while some very recent works exploring multivariate time series considered only numerical attributes and are not applicable to multivariate time series containing both of numerical and categorical attributes. In this paper, we present a novel methodology named

REACT (Reliable EArly ClassificaTion)

, which is the first work addressing the issue of constructing an effective classifier on multivariate time series with numerical and categorical attributes in serial manner so as to guarantee stability of accuracy compared to the classifiers using full-length time series. Furthermore, we also employ the GPU parallel computing technique to develop an extended mechanism for building the early classifier efficiently. Experimental results on real datasets show that

REACT

significantly outperforms the state-of-the-art method in terms of accuracy and earliness, and the GPU implementation is verified to substantially enhance the efficiency by several orders of magnitudes.

Yu-Feng Lin, Hsuan-Hsu Chen, Vincent S. Tseng, Jian Pei
Distributed Document Representation for Document Classification

The distributed vector representations learned from the deep learning framework have shown its great power in capturing the semantic meaning of words, phrases and sentences, from which multiple NLP applications have benefited. As words combine to form the meaning of sentences, so do sentences combine to form the meaning of documents, the idea of representing each document with a dense distributed representation holds promise. In this paper, we propose a supervised framework (Compound RNN) for document classification based on document-level distributed representations learned from deep learning architecture. Our framework first obtains the distributed representation at sentence-level by operating on the parse tree structure from recursive neural network, and then obtains the document presentation-level by convoluting the sentence vectors from a recurrent neural network. Our framework (Compound RNN) outperforms existing document representations such as bag-of-words, LDA in multiple text classification/regression tasks.

Rumeng Li, Hiroyuki Shindo
Prediciton of Emergency Events: A Multi-Task Multi-Label Learning Approach

Prediction of patient outcomes is critical to plan resources in an hospital emergency department. We present a method to exploit longitudinal data from Electronic Medical Records (EMR), whilst exploiting multiple patient outcomes. We divide the EMR data into

segments

where each segment is a

task

, and all tasks are associated with multiple patient outcomes over a 3, 6 and 12 month period. We propose a model that learns a prediction function for each task-label pair, interacting through

two

subspaces: the

first

subspace is used to impose sharing across

all tasks for a given label

. The

second

subspace captures the task-specific variations and is shared across all the labels for a given task. The proposed model is formulated as an iterative optimization problems and solved using a scalable and efficient Block co-ordinate descent (BCD) method. We apply the proposed model on two hospital cohorts - Cancer and Acute Myocardial Infarction (AMI) patients collected over a two year period from a large hospital emergency department. We show that the predictive performance of our proposed models is significantly better than those of several state-of-the-art multi-task and multi-label learning methods.

Budhaditya Saha, Sunil K. Gupta, Svetha Venkatesh
Nearest Neighbor Method Based on Local Distribution for Classification

The

k

-nearest-neighbor (

k

NN) algorithm is a simple but effective classification method which predicts the class label of a query sample based on information contained in its neighborhood. Previous versions of

k

NN usually consider the

k

nearest neighbors separately by the quantity or distance information. However, the quantity and the isolated distance information may be insufficient for effective classification decision. This paper investigates the

k

NN method from a perspective of local distribution based on which we propose an improved implementation of

k

NN. The proposed method performs the classification task by assigning the query sample to the class with the maximum posterior probability which is estimated from the local distribution based on the Bayesian rule. Experiments have been conducted using 15 benchmark datasets and the reported experimental results demonstrate excellent performance and robustness for the proposed method when compared to other state-of-the-art classifiers.

Chengsheng Mao, Bin Hu, Philip Moore, Yun Su, Manman Wang
Immune Centroids Over-Sampling Method for Multi-Class Classification

To improve the classification performance of imbalanced learning, a novel over-sampling method, Global Immune Centroids Over-Sampling (Global-IC) based on an immune network, is proposed. Global-IC generates a set of representative immune centroids to broaden the decision regions of small class spaces. The representative immune centroids are regarded as synthetic examples in order to resolve the imbalance problem. We utilize an artificial immune network to generate synthetic examples on clusters with high data densities. This approach addresses the problem of synthetic minority oversampling techniques, which lacks of the reflection on groups of training examples. Our comprehensive experimental results show that Global-IC can achieve better performance than renowned multi-class resampling methods.

Xusheng Ai, Jian Wu, Victor S. Sheng, Pengpeng Zhao, Yufeng Yao, Zhiming Cui
Optimizing Classifiers for Hypothetical Scenarios

The deployment of classification models is an integral component of many modern data mining and machine learning applications. A typical classification model is built with the tacit assumption that the deployment scenario by which it is evaluated is fixed and fully characterized. Yet, in the practical deployment of classification methods, important aspects of the application environment, such as the misclassification costs, may be uncertain during model building. Moreover, a single classification model may be applied in several different deployment scenarios. In this work, we propose a method to optimize a model for uncertain deployment scenarios. We begin by deriving a relationship between two evaluation measures,

H

measure and cost curves, that may be used to address uncertainty in classifier performance. We show that when uncertainty in classifier performance is modeled as a probabilistic belief that is a function of this underlying relationship, a natural definition of

risk

emerges for both classifiers and instances. We then leverage this notion of risk to develop a boosting-based algorithm—which we call

RiskBoost

—that directly mitigates classifier risk, and we demonstrate that it outperforms AdaBoost on a diverse selection of datasets.

Reid A. Johnson, Troy Raeder, Nitesh V. Chawla
Repulsive-SVDD Classification

Support vector data description (SVDD) is a well-known kernel method that constructs a minimal hypersphere regarded as a data description for a given data set. However SVDD does not take into account any statistical distribution of the data set in constructing that optimal hypersphere, and SVDD is applied to solving one-class classification problems only. This paper proposes a new approach to SVDD to address those limitations. We formulate an optimisation problem for binary classification in which we construct two hyperspheres, one enclosing positive samples and the other enclosing negative samples, and during the optimisation process we move the two hyperspheres apart to maximise the margin between them while the data samples of each class are still inside their own hyperspheres. Experimental results show good performance for the proposed method.

Phuoc Nguyen, Dat Tran
Centroid-Means-Embedding: An Approach to Infusing Word Embeddings into Features for Text Classification

This paper presents word embedding-based approach to text classification. In this study, we introduce a new vector space model called Semantically-Augmented Statistical Vector Space Model (SAS-VSM) that is a statistical VSM with a semantic VSM for information access systems, especially for automatic text classification. In the SAS-VSM, we first implement a primary approach to concatenate continuous-valued semantic features with an existing statistical VSM. We, then, introduce the Centroid-Means-Embedding (CME) method that updates existing statistical feature vectors with semantic knowledge. Experimental results show that the proposed CME-based SAS-VSM approaches are promising over the different weighting approaches on the 20 Newsgroups and RCV1-v2/LYRL2004 datasets using Support Vector Machine (SVM) classifiers to enhance the classification tasks. Our approach outperformed other approaches in both micro-F

$$_1$$

and categorical performance.

Mohammad Golam Sohrab, Makoto Miwa, Yutaka Sasaki

Machine Learning

Frontmatter
Collaborating Differently on Different Topics: A Multi-Relational Approach to Multi-Task Learning

Multi-task learning offers a way to benefit from synergy of multiple related prediction tasks via their joint modeling. Current multi-task techniques model related tasks jointly, assuming that the tasks share the same relationship across features uniformly. This assumption is seldom true as tasks may be related across some features but not others. Addressing this problem, we propose a new multi-task learning model that learns separate task relationships along different features. This added flexibility allows our model to have a finer and differential level of control in joint modeling of tasks along different features. We formulate the model as an optimization problem and provide an efficient, iterative solution. We illustrate the behavior of the proposed model using a synthetic dataset where we induce varied feature-dependent task relationships: positive relationship, negative relationship, no relationship. Using four real datasets, we evaluate the effectiveness of the proposed model for many multi-task regression and classification problems, and demonstrate its superiority over other state-of-the-art multi-task learning models.

Sunil Kumar Gupta, Santu Rana, Dinh Phung, Svetha Venkatesh
Multi-Task Metric Learning on Network Data

Multi-task learning (MTL) has been shown to improve prediction performance in a number of different contexts by learning models jointly on multiple different, but related tasks. In this paper, we propose to do MTL on general network data, which provide an important context for MTL. We first show that MTL on network data is a common problem that has many concrete and valuable applications. Then, we propose a metric learning approach that can effectively exploit correlation across multiple tasks and networks. The proposed approach builds on structural metric learning and intermediate parameterization, and has efficient an implementation via stochastic gradient descent. In experiments, we challenge it with two common real-world applications: citation prediction for Wikipedia articles and social circle prediction in Google+. The proposed method achieves promising results and exhibits good convergence behavior.

Chen Fang, Daniel N. Rockmore
A Bayesian Nonparametric Approach to Multilevel Regression

Regression is at the cornerstone of statistical analysis. Multilevel regression, on the other hand, receives little research attention, though it is prevalent in economics, biostatistics and healthcare to name a few. We present a Bayesian nonparametric framework for multilevel regression where

individuals

including observations and outcomes are organized into

groups

. Furthermore, our approach exploits additional group-specific context observations, we use Dirichlet Process with product-space base measure in a nested structure to model group-level context distribution and the regression distribution to accommodate the multilevel structure of the data. The proposed model simultaneously partitions groups into cluster and perform regression. We provide collapsed Gibbs sampler for posterior inference. We perform extensive experiments on econometric panel data and healthcare longitudinal data to demonstrate the effectiveness of the proposed model.

Vu Nguyen, Dinh Phung, Svetha Venkatesh, Hung H. Bui
Learning Conditional Latent Structures from Multiple Data Sources

Data usually present in heterogeneous sources. When dealing with multiple data sources, existing models often treat them independently and thus can not explicitly model the correlation structures among data sources. To address this problem, we propose a full Bayesian nonparametric approach to model correlation structures among multiple and heterogeneous datasets. The proposed framework, first, induces mixture distribution over primary data source using hierarchical Dirichlet processes (HDP). Once conditioned on each atom (group) discovered in previous step,

context

data sources are mutually independent and each is generated from hierarchical Dirichlet processes. In each specific application, which covariates constitute content or context(s) is determined by the nature of data. We also derive the efficient inference and exploit the conditional independence structure to propose (conditional) parallel Gibbs sampling scheme. We demonstrate our model to address the problem of latent activities discovery in pervasive computing using mobile data. We show the advantage of utilizing multiple data sources in terms of exploratory analysis as well as quantitative clustering performance.

Viet Huynh, Dinh Phung, Long Nguyen, Svetha Venkatesh, Hung H. Bui
Collaborative Multi-view Learning with Active Discriminative Prior for Recommendation

Learning from multi-view data is important in many applications. However, traditional multi-view learning algorithms require the availability of the representation from multi-view data in advance, it is hard to apply these methods to recommendation task directly. In fact, the idea of multi-view learning is particularly suitable for alleviating the sparsity challenge faced in various recommender systems by adding additional view to augment traditional view of sparse rating matrix. In this paper, we propose a unified Collaborative Multi-view Learning (CML) framework for recommender systems, which can exploit task adaptive multi-view representation of data automatically. The main idea is to formulate a joint optimization framework, combining the merits of matrix factorization model and transfer learning technique in a multi-view framework. Experiments on real-life public datasets show that our model outperforms the compared state-of-the-art baselines.

Qing Zhang, Houfeng Wang
Online and Stochastic Universal Gradient Methods for Minimizing Regularized Hölder Continuous Finite Sums in Machine Learning

Online and stochastic gradient methods have emerged as potent tools in large scale optimization with both smooth convex and nonsmooth convex problems from the classes

$$C^{1,1}(\mathbb {R}^p)$$

and

$$C^{1,0}(\mathbb {R}^p)$$

respectively. However, to our best knowledge, there is few paper using incremental gradient methods to optimization the intermediate classes of convex problems with Hölder continuous functions

$$C^{1,v}(\mathbb {R}^p)$$

. In order to fill the difference and the gap between the methods for smooth and nonsmooth problems, in this work, we propose several online and stochastic universal gradient methods, which we do not need to know the actual degree of the smoothness of the objective function in advance. We expanded the scope of the problems involved in machine learning to Hölder continuous functions and to propose a general family of first-order methods. Regret and convergent analysis shows that our methods enjoy strong theoretical guarantees. For the first time, we establish algorithms that enjoys a linear convergence rate for convex functions that have Hölder continuous gradients.

Ziqiang Shi, Rujie Liu
Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages

The malicious modification of articles, termed vandalism, is a serious problem for open access encyclopedias such as Wikipedia. Wikipedia’s counter-vandalism bots and past vandalism detection research have greatly reduced the exposure and damage of common and obvious types of vandalism. However, there remains increasingly more sneaky types of vandalism that are clearly out of context of the sentence or article. In this paper, we propose a novel context-aware and cross-language vandalism detection technique that scales to the size of the full Wikipedia and extends the types of vandalism detectable beyond past feature-based approaches. Our technique uses word dependencies to identify vandal words in sentences by combining part-of-speech tagging with a conditional random fields classifier. We evaluate our technique on two Wikipedia data sets: the PAN data sets with over 62,000 edits, commonly used by related research; and our own vandalism repairs data sets with over 500 million edits of over 9 million articles from five languages. As a comparison, we implement a feature-based classifier to analyse the quality of each classification technique and the trade-offs of each type of classifier. Our results show how context-aware detection techniques can become a new counter-vandalism tool for Wikipedia that complements current feature-based techniques.

Khoi-Nguyen Tran, Peter Christen, Scott Sanner, Lexing Xie
Uncovering the Latent Structures of Crowd Labeling

Crowdsourcing provides a new way to distribute enormous tasks to a crowd of annotators. The divergent knowledge background and personal preferences of crowd annotators lead to noisy (or even inconsistent) answers to a same question. However, diverse labels provide us information about the underlying structures of tasks and annotators. This paper proposes latent-class assumptions for learning-from-crowds models, that is, items can be separated into several latent classes and workers’ annotating behaviors may differ among different classes. We propose a nonparametric model to uncover the latent classes, and also extend the state-of-the-art minimax entropy estimator to learn latent structures. Experimental results on both synthetic data and real data collected from Amazon Mechanical Turk demonstrate our methods can disclose interesting and meaningful latent structures, and incorporating latent class structures can also bring significant improvements on ground truth label recovery for difficult tasks.

Tian Tian, Jun Zhu
Use Correlation Coefficients in Gaussian Process to Train Stable ELM Models

This paper proposes a new method to train stable extreme learning machines (ELM). The new method, called StaELM, uses correlation coefficients in Gaussian process to measure the similarities between different hidden layer outputs. Different from kernel operations such as linear or RBF kernels to handle hidden layer outputs, using correlation coefficients can quantify the similarity of hidden layer outputs with real numbers in

$$(0,1]$$

and avoid covariance matrix in Gaussian process to become a singular matrix. Training through Gaussian process results in ELM models insensitive to random initialization and can avoid over-fitting. We analyse the rationality of StaELM and show that existing kernel-based ELMs are special cases of StaELM. We used real world datasets to train both regression and classification StaELM models. The experiment results have shown that StaELM models achieved higher accuracies in both regression and classification in comparison with traditional kernel-based ELMs. The StaELM models are more stable with respect to different random initializations and less over-fitting. The training process of StaELM models is also faster.

Yulin He, Joshua Zhexue Huang, Xizhao Wang, Rana Aamir Raza
Local Adaptive and Incremental Gaussian Mixture for Online Density Estimation

In this paper, we propose an incremental and local adaptive gaussian mixture for online density estimation (LAIM). Using a similarity threshold based criterion, the method is able to allocate components incrementally to accommodate novel data points without affecting previously learned components. A local adaptive learning strategy is presented for estimating density with complex structure in an online way. We also adopt a denoising scheme to make the algorithm more robust to noise. We compared the LAIM to the state-of-art methods for density estimation in both artificial and real data sets, the results show that our method outperforms the compared online counterpart and produces comparable results to the compared batch algorithms.

Tianyu Qiu, Furao Shen, Jinxi Zhao
Latent Space Tracking from Heterogeneous Data with an Application for Anomaly Detection

Streaming heterogeneous information is ubiquitous in the era of Big Data, which provides versatile perspectives for more comprehensive understanding of behaviors of an underlying system/process. Human analysis of these volumes is infeasible, leading to unprecedented demands for mathematical tools which effectively parse and distill such data. However, the complicated nature of streaming heterogeneous data prevents the conventional multivariate data analysis methods being applied immediately. In this paper, we propose a novel framework together with an online algorithm, denoted as

$$\mathtt {LSTH}$$

, for latent space tracking from heterogeneous data. Our method leverages the advantages of dimension reduction, correlation analysis and sparse learning to better reveal the latent relations among heterogeneous information and adapt to slow variations in streaming data. We applied our method on both synthetic and real data, and it achieves results competitive with or superior to the state-of-the-art in detecting several different types of anomalies.

Jiaji Huang, Xia Ning
A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization

Stochastic gradient methods are effective to solve matrix factorization problems. However, it is well known that the performance of stochastic gradient method highly depends on the learning rate schedule used; a good schedule can significantly boost the training process. In this paper, motivated from past works on convex optimization which assign a learning rate for each variable, we propose a new schedule for matrix factorization. The experiments demonstrate that the proposed schedule leads to faster convergence than existing ones. Our schedule uses the same parameter on all data sets included in our experiments; that is, the time spent on learning rate selection can be significantly reduced. By applying this schedule to a state-of-the-art matrix factorization package, the resulting implementation outperforms available parallel matrix factorization packages.

Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, Chih-Jen Lin

Applications

Frontmatter
On Damage Identification in Civil Structures Using Tensor Analysis

Structural health monitoring is a condition-based technology to monitor infrastructure using sensing systems. In structural health monitoring, the data are usually highly redundant and correlated. The measured variables are not only correlated with each other at a certain time but also are autocorrelated themselves over time. Matrix-based two-way analysis, which is usually used in structural health monitoring, can not capture all these relationships and correlations together. Tensor analysis allows us to analyse the vibration data in temporal, spatial and feature modes at the same time. In our approach, we use tensor analysis and one-class support vector machine for damage detection, localization and estimation in an unsupervised manner. The method shows promising results using data from lab-based structures and also data collected from the Sydney Harbour Bridge, one of iconic structures in Australia. We can obtain a damage detection accuracy of 0.98 and higher for all the data. Locations of damage were captured correctly and different levels of damage severity were well estimated.

Nguyen Lu Dang Khoa, Bang Zhang, Yang Wang, Wei Liu, Fang Chen, Samir Mustapha, Peter Runcie
Predicting Smartphone Adoption in Social Networks

The recent advancements in online social networks and mobile devices have provided valuable data sources to track users’ smartphone adoption, i.e., the usage of smartphones over time. An incisive understanding of users’ smartphone adoption can benefit many useful applications, ranging from user behavior understanding to targeted marketing. This paper studies smartphone adoption prediction in social networks by leveraging the wisdom of an online world. A critical challenge along this line is to identify the key factors that underline people’s adoption behaviors and distinguish the relative contribution of each factor. Specifically, we model the final smartphone status of each user as a result of three influencing factors: the social influence factor, the homophily factor, and the personal factor. We further develop a supervised model that takes all three factors for smartphone adoption and at the same time learns the relative contribution of each factor from the data. Experimental results on a large real world dataset demonstrate the effectiveness of our proposed model.

Le Wu, Yin Zhu, Nicholas Jing Yuan, Enhong Chen, Xing Xie, Yong Rui
Discovering the Impact of Urban Traffic Interventions Using Contrast Mining on Vehicle Trajectory Data

There is growing interest in using trajectory data of moving vehicles to analyze urban traffic and improve city planning. This paper presents a framework to assess the impact of traffic intervention measures, such as road closures, on the traffic network. Connected road segments with significantly different traffic levels before and after the intervention are discovered by computing the growth rate. Frequent sub-networks of the overall traffic network are then discovered to reveal the region that is most affected. The effectiveness and robustness of this framework are shown by three experiments using real taxi trajectories and traffic simulations in two different cities.

Xiaoting Wang, Christopher Leckie, Hairuo Xie, Tharshan Vaithianathan
Locating Self-Collection Points for Last-Mile Logistics Using Public Transport Data

Delivery failure and re-scheduling cause the delay of services and increase the operation costs for logistics companies. Setting up self-collection points is an effective solution that is attracting attentions from many companies. One challenge for this model is how to choose the locations for self-collection points. In this work, we design a methodology for locating self-collection points. We consider both the distribution of a company’s potential customers and the people’s gathering pattern in the city. We leverage on citizens’ public transport riding records to simulate how the crowds emerge for particular hours. We reasonably assume that a place near to a people crowd is more convenient for customers than a place far away for self parcel collection. Based on this, we propose a kernel transformation method to re-evaluate the pairwise positions of customers, and then do a clustering.

Huayu Wu, Dongxu Shao, Wee Siong Ng
A Stochastic Framework for Solar Irradiance Forecasting Using Condition Random Field

Solar irradiance volatility is a major concern in integrating solar energy micro-grids to the mainstream energy power grid. Accounting for such fluctuations is challenging even with supplier coordination and smart-grid structure implementation. Short-term solar irradiance forecasting is one of the crucial components for maintaining a constant and reliable power output. We propose a novel stochastic solar prediction framework using Conditional Random Fields. The proposed model utilizes features extracted from both cloud images taken by Total Sky Imagers and historical statistics to synergistically reduce the prediction error by

$$25$$

-

$$40\%$$

in terms of MAE in

$$1$$

-

$$5$$

minute forecast experiments over the baseline methods.

Jin Xu, Shinjae Yoo, Dantong Yu, Hao Huang, Dong Huang, John Heiser, Paul Kalb
Online Prediction of Chess Match Result

In this work we propose a framework for predicting chess match outcome while the game is in progress. We make this prediction by examining the moves made by the players. For this purpose, we propose a novel ensemble based learning technique where a profile-based segmentation is done on the training dataset, and one classifier is trained from each such segment. Then the ensemble of classifiers is used to predict the outcome of new chess matches. When a new game is being played this ensemble model is used to

dynamically

predict the probabilities of

white winning

,

black winning

, and

drawing

after every move. We have evaluated our system with different base learning techniques as well as with different types of features and applied our technique on a large corpus of real chess matches, achieving higher prediction accuracies than traditional classification techniques. We have achieved prediction accuracies close to 66% and most of the correct predictions were made with nine or more moves before the game ended. We believe that this work will motivate the development of online prediction systems for other games, such as other board games and even some field games.

Mohammad M. Masud, Ameera Al-Shehhi, Eiman Al-Shamsi, Shamma Al-Hassani, Asmaa Al-Hamoudi, Latifur Khan
Learning of Performance Measures from Crowd-Sourced Data with Application to Ranking of Investments

Interestingness measures stand as proxy for “real human interest,” but their effectiveness is rarely studied empirically due to the difficulty of obtaining ground-truth data. We propose a method based on learning-to-rank algorithms that enables pairwise rankings collected from domain community members to be used to learn a domain-specific measure. We apply this method to study the interestingness measures in finance, specifically, investment performance evaluation measures. More than 100 such measures have been proposed with no way of knowing which most closely matches the preferences of domain users. We use crowd-sourcing to collect gold-standard truth from traders and quantitative analysts in the form of pairwise rankings of equity graphs. With these rankings, we evaluate the accuracy with which each measure predicts the user-preferred equity graph. We then learn a new investment performance measure which has higher test accuracy than the currently proposed measures, in particular the commonly used Sharpe ratio.

Greg Harris, Anand Panangadan, Viktor K. Prasanna
Hierarchical Dirichlet Process for Tracking Complex Topical Structure Evolution and Its Application to Autism Research Literature

In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, splitting and merging. The power of the proposed framework is demonstrated on the medical literature corpus concerned with the autism spectrum disorder (ASD) – an increasingly important research subject of significant social and healthcare importance. In addition to the collected ASD literature corpus which we made freely available, our contributions also include two free online tools we built as aids to ASD researchers. These can be used for semantically meaningful navigation and searching, as well as knowledge discovery from this large and rapidly growing corpus of literature.

Adham Beykikhoshk, Ognjen Arandjelović, Svetha Venkatesh, Dinh Phung
Automated Detection for Probable Homologous Foodborne Disease Outbreaks

Foodborne disease, a rapid-growing public health problem, has become the highest-priority topic for food safety. The threat of foodborne disease has stimulated interest in enhancing public health surveillance to detect outbreaks rapidly. To advance research on food risk assessment in China, China National Center for Food Safety Risk Assessment (CFSA) sponsored a project to construct an online correlation analysis system for foodborne disease surveillance beginning in October 2012. They collect foodborne disease clinical data from sentinel hospitals across the country. They want to analyze the foodborne disease outbreaks existed in the collected data and finally find the link between pathogen, incriminated food sources and infected persons. Rapid detection of outbreaks is a critical first step for the analysis. The purpose of this paper is to provide approaches that can be applied to an online system to rapidly find local and sporadic foodborne disease outbreaks out of the collected data. Specifically, we employ DBSCAN for local outbreaks detection and solve the parameter self-adaptive problem in DBSCAN. We also propose a new approach named K-CPS (K-Means Clustering with Pattern Similarity) to detect sporadic outbreaks. The experimental results show that our methods are effective for rapidly mining local and sporadic outbreaks from the dataset.

Xiao Xiao, Yong Ge, Yunchang Guo, Danhuai Guo, Yi Shen, Yuanchun Zhou, Jianhui Li
Identifying Hesitant and Interested Customers for Targeted Social Marketing

Social networks provide unparalleled opportunities for marketing products or services. Along this line, tremendous efforts have been devoted to the research of targeted social marketing, where the marketing efforts could be concentrated on a particular set of users with high utilities. Traditionally, these targeted users are identified based on their potential interests to the given company (product). However, social users are usually influenced simultaneously by multiple companies, and not only the user interest but also these social influences will contribute to the user consumption behaviors. To that end, in this paper, we propose a general approach to figure out the targeted users for social marketing, taking both user interests and multiple social influences into consideration. Specifically, we first formulate it as an Identifying Hesitant and Interested Customers (IHIC) problem, where we argue that these valuable users should have the best balanced influence entropy (being “Hesitant”) and utility scores (being “Interested”). Then, we design a novel framework and propose specific algorithms to solve this problem. Finally, extensive experiments on two real-world datasets validate the effectiveness and the efficiency of our proposed approach.

Guowei Ma, Qi Liu, Le Wu, Enhong Chen
Activity-Partner Recommendation

In many activities, such as watching movies or having dinner, people prefer to find partners before participation. Therefore, when recommending activity items (e.g., movie tickets) to users, it makes sense to also recommend suitable activity partners. This way, (i) the users save time for finding activity partners, (ii) the effectiveness of the item recommendation is increased (users may prefer activity items more if they can find suitable activity partners), (iii) recommender systems become more interesting and enkindle users’ social enthusiasm. In this paper, we identify the usefulness of suggesting activity partners together with items in recommender systems. In addition, we propose and compare several methods for activity-partner recommendation. Our study includes experiments that test the practical value of activity-partner recommendation and evaluate the effectiveness of all suggested methods as well as some alternative strategies.

Wenting Tu, David W. Cheung, Nikos Mamoulis, Min Yang, Ziyu Lu
Iterative Use of Weighted Voronoi Diagrams to Improve Scalability in Recommender Systems

Collaborative Filtering

(CF) technique is used by most of the

Recommender Systems

(RS) for formulating suggestions of item relevant to users’ interest. It typically associates a user with a community of like minded users, and then recommend items to the user liked by others in the community. However, with the rapid growth of the Web in terms of users and items, majority of the RS using CF technique suffers from the

scalability

problem. In order to address this

scalability

issue, we propose a decomposition based Recommendation Algorithm using

Multiplicatively Weighted Voronoi Diagrams

. We divide the entire users’ space into smaller regions based on the location, and then apply the Recommendation Algorithm separately to these regions. This helps us to avoid computations over the entire data. We measure Spatial Autocorrelation indices in the regions or cells formed by the Voronoi decomposition. One of the main objectives of our work is to reduce the running time without compromising the recommendation quality much. This ensures scalability, allowing us to tackle bigger datasets using the same resources. We have tested our algorithms on the MovieLens and Book-Crossing datasets. Our proposed decomposition scheme is oblivious of the underlying recommendation algorithm.

Joydeep Das, Subhashis Majumder, Debarshi Dutta, Prosenjit Gupta

Novel Methods and Algorithms

Frontmatter
Principal Sensitivity Analysis

We present a novel algorithm (Principal Sensitivity Analysis; PSA) to analyze the knowledge of the classifier obtained from supervised machine learning techniques. In particular, we define principal sensitivity map (PSM) as the direction on the input space to which the trained classifier is most sensitive, and use analogously defined

$$k$$

-th PSM to define a basis for the input space. We train neural networks with artificial data and real data, and apply the algorithm to the obtained supervised classifiers. We then visualize the PSMs to demonstrate the PSA’s ability to decompose the knowledge acquired by the trained classifiers.

Sotetsu Koyamada, Masanori Koyama, Ken Nakae, Shin Ishii
SocNL: Bayesian Label Propagation with Confidence

How can we predict Smith’s main hobby if we know the main hobby of Smith’s friends? Can we measure the confidence in our prediction if we are given the main hobby of only a few of Smith’s friends? In this paper, we focus on how to estimate the confidence on the node classification problem. Providing a confidence level for the classification problem is important because most nodes in real world networks tend to have few neighbors, and thus, a small amount of evidence. Our contributions are three-fold: (a)

novel algorithm

; we propose a semi-supervised learning algorithm that converges fast, and provides the confidence estimate (b)

theoretical analysis

; we show the solid theoretical foundation of our algorithm and the connections to label propagation and Bayesian inference (c)

empirical analysis

; we perform extensive experiments on three different real networks. Specifically, the experimental results demonstrate that our algorithm outperforms other algorithms on graphs with less smoothness and low label density.

Yuto Yamaguchi, Christos Faloutsos, Hiroyuki Kitagawa
An Incremental Local Distribution Network for Unsupervised Learning

We present an

I

ncremental

L

ocal

D

istribution

N

etwork (

ILDN

) for unsupervised learning, which combines the merits of matrix learning and incremental learning. It stores local distribution information in each node with covariant matrix and uses a vigilance parameter with statistical support to decide whether to extend the network. It has a statistics based merging mechanism and thus can obtain a precise and concise representation of the learning data called relaxation representation. Moreover, the denoising process based on data density makes ILDN robust to noise and practically useful. Experiments on artificial and real-world data in both “closed” and “open-ended” environment show the better accuracy, conciseness, and efficiency of ILDN over other methods.

Youlu Xing, Tongyi Cao, Ke Zhou, Furao Shen, Jinxi Zhao
Trend-Based Citation Count Prediction for Research Articles

This paper aims to predict the future impact, measured by the citation count, of any papers of interest. While existing studies utilized the features related to the paper content or publication information to do

Citation Count Prediction

(CCP), we propose to leverage the citation count trend of a paper and develop a

Trend-based Citation Count Prediction

(T-CCP) model. By observing the citation count fluctuation of a paper along with time, we identify five typical citation trends: early burst, middle burst, late burst, multi bursts, and no bursts. T-CCP first performs

Citation Trend Classification

(CTC) to detect the citation trend of a paper, and then learns the predictive function for each trend to predict the citation count. We investigate two categories of features for CCP, CTC, and T-CCP: the publication features, including author, venue, expertise, social, and reinforcement features, and the early citation behaviors, including citation statistical and structural features. Experiments conducted on the

ArnetMiner

citation dataset exhibit promising results that T-CCP outperforms CCP and the proposed features are more effective than conventional ones.

Cheng-Te Li, Yu-Jen Lin, Rui Yan, Mi-Yen Yeh
Mining Text Enriched Heterogeneous Citation Networks

The paper presents an approach to mining text enriched heterogeneous information networks, applied to a task of categorizing papers from a large citation network of scientific publications in the field of psychology. The methodology performs network propositionalization by calculating structural context vectors from homogeneous networks, extracted from the original network. The classifier is constructed from a table of structural context vectors, enriched with the bag-of-words vectors calculated from individual paper abstracts. A series of experiments was performed to examine the impact of increasing the number of publications in the network, and adding different types of structural context vectors. The results indicate that increasing the network size and combining both types of information is beneficial for improving the accuracy of paper categorization.

Jan Kralj, Anita Valmarska, Marko Robnik-Šikonja, Nada Lavrač
Boosting via Approaching Optimal Margin Distribution

Margin distribution is crucial to AdaBoost. In this paper, we propose a new boosting method by utilizing the Emargin bound to approach the optimal margin distribution. We first define the

$$k^*$$

-optimization margin distribution, which has a sharper Emargin bound than that of AdaBoost. Then we present two boosting algorithms, KM-Boosting and MD-Boosting, both of which approximately approach the

$$k^*$$

-optimization margin distribution using the relation between the

$$k$$

th margin bound and the Emargin bound. Finally, we show that boosting on the

$$k^*$$

-optimization margin distribution is sound and efficient. Especially, MD-Boosting almost surely has a sharper bound than that of AdaBoost, and just needs a little more computational cost than that of AdaBoost, which means that MD-Boosting is effective in redundancy reduction without losing much accuracy.

Chuan Liu, Shizhong Liao
o-HETM: An Online Hierarchical Entity Topic Model for News Streams

Nowadays, with the development of the Internet, large amount of continuous streaming news has become overwhelming to the public. Constructing a dynamic topic hierarchy which organizes the news articles according to multi-grain topics can enable the users to catch whatever they are interested in as soon as possible. However, it is nontrivial due to the streaming and time-sensitive characteristics of news data. In this paper, to address the challenges, we propose a Hierarchical Entity Topic Model (HETM) which considers the timeliness of news data and the importance of named entities in conveying information of who/when/where in news articles. In addition, we propose online HETM (o-HETM) by presenting a fast online inference algorithm for HETM to adapt it to streaming news. For better understanding of topics, we extract key sentences for each topic to form a summary. Extensive experimental results demonstrate that our model HETM significantly improves the topic quality and time efficiency, compared to state-of-the-art method HLDA (Hierarchical Latent Dirichlet Allocation). In addition, our proposed o-HETM with an online inference algorithm further greatly improves the time efficiency and thus can be applicable to the streaming news.

Linmei Hu, Juanzi Li, Jing Zhang, Chao Shao
Modeling User Interest and Community Interest in Microbloggings: An Integrated Approach

To explain why a user generates some observed content and behaviors, one has to determine the user’s topical interests as well as that of her community. Most existing works on modeling microblogging users and their communities however are based on either user generated content or user behaviors, but not both. In this paper, we propose the

Community and Personal Interest

(

CPI

) model, for modeling interest of microblogging users jointly with that of their communities using both the content and behaviors. The

CPI

model also provides a common framework to accommodate multiple types of user behaviors. Unlike the other models,

CPI

does not assume a hierarchical relationship between personal interest and community interest, i.e., one is determined purely based on the other. We build the

CPI

model based on the principle that a user’s personal interest is different from that of her community. We further develop a regularization technique to bias the model to learn more socially meaningful topics for each community. Our experiments on a Twitter dataset show that the

CPI

model outperforms other state-of-the-art models in topic learning and user classification tasks. We also demonstrate that the

CPI

model can effectively mine community interest through some representative case examples.

Tuan-Anh Hoang
Minimal Jumping Emerging Patterns: Computation and Practical Assessment

Jumping Emerging Patterns (JEP) are patterns that only occur in objects of a single class, a

minimal JEP

is a JEP where none of its proper subsets is a JEP. In this paper, an efficient method to mine the

whole set

of the minimal JEPs is detailed and fully proven. Moreover, our method has a larger scope since it is able to compute the essential JEPs and the top-k minimal JEPs. We also extract minimal JEPs where the absence of attributes is stated, and we show that this leads to the discovery of new valuable pieces of information. A performance study is reported to evaluate our approach and the practical efficiency of minimal JEPs in the design of rules to express correlations is shown.

Bamba Kane, Bertrand Cuissart, Bruno Crémilleux
Rank Matrix Factorisation

We introduce the problem of

rank matrix factorisation

(RMF). That is, we consider the decomposition of a rank matrix, in which each row is a (partial or complete) ranking of all columns. Rank matrices naturally appear in many applications of interest, such as sports competitions. Summarising such a rank matrix by two smaller matrices, in which one contains partial rankings that can be interpreted as local patterns, is therefore an important problem.

After introducing the general problem, we consider a specific instance called Sparse RMF, in which we enforce the rank profiles to be sparse, i.e., to contain many zeroes. We propose a greedy algorithm for this problem based on integer linear programming. Experiments on both synthetic and real data demonstrate the potential of rank matrix factorisation.

Thanh Le Van, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt
An Empirical Study of Personal Factors and Social Effects on Rating Prediction

In social networks, the link between a pair of friends has been reported effective in improving recommendation accuracy. Previous studies mainly based on the assumption that any pair of friends shall have similar interests, via minimizing the gap between user’s taste and the average (or similar) taste of this user’s friends to reduce the error of rating prediction. However, these methods ignore the diversity of user’s taste. In this paper, we focus on learning the diversity of user’s taste and effects from this user’s friends in terms of rating behavior. We propose a novel recommendation approach, namely

P

ersonal factors with

W

eighted

S

ocial effects Matrix Factorization (PWS), which utilities both user’s taste and social effects to provide recommendations. Experimental results carried out on 3 datasets, show the effectiveness of the proposed approach.

Zhijin Wang, Yan Yang, Qinmin Hu, Liang He
Backmatter
Metadaten
Titel
Advances in Knowledge Discovery and Data Mining
herausgegeben von
Tru Cao
Ee-Peng Lim
Zhi-Hua Zhou
Tu-Bao Ho
David Cheung
Hiroshi Motoda
Copyright-Jahr
2015
Electronic ISBN
978-3-319-18038-0
Print ISBN
978-3-319-18037-3
DOI
https://doi.org/10.1007/978-3-319-18038-0

Premium Partner