Skip to main content

2015 | Buch

Advances in Knowledge Discovery and Data Mining

19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II

herausgegeben von: Tru Cao, Ee-Peng Lim, Zhi-Hua Zhou, Tu-Bao Ho, David Cheung, Hiroshi Motoda

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume set, LNAI 9077 + 9078, constitutes the refereed proceedings of the 19th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2015, held in Ho Chi Minh City, Vietnam, in May 2015.

The proceedings contain 117 paper carefully reviewed and selected from 405 submissions. They have been organized in topical sections named: social networks and social media; classification; machine learning; applications; novel methods and algorithms; opinion mining and sentiment analysis; clustering; outlier and anomaly detection; mining uncertain and imprecise data; mining temporal and spatial data; feature extraction and selection; mining heterogeneous, high-dimensional, and sequential data; entity resolution and topic-modeling; itemset and high-performance data mining; and recommendations.

Inhaltsverzeichnis

Frontmatter

Opinion Mining and Sentiment Analysis

Frontmatter
Emotion Cause Detection for Chinese Micro-Blogs Based on ECOCC Model

Micro-blog emotion mining and emotion cause extraction are essential in social network data mining. This paper presents a novel approach on Chinese micro-blog emotion cause detection based on the ECOCC model, focusing on mining factors for eliciting some kinds of emotions. In order to do so, the corresponding emotion causes are extracted. Moreover, the proportions of different cause components under different emotions are also calculated by means of combining the emotional lexicon with multiple characteristics (e.g., emoticon, punctuation, etc.). Experimental results show the feasibility of the approach. The proposed approaches have important scientific values on social network knowledge discovery and data mining.

Kai Gao, Hua Xu, Jiushuo Wang
Parallel Recursive Deep Model for Sentiment Analysis

Sentiment analysis has now become a popular research problem to tackle in Artificial Intelligence (AI) and Natural Language Processing (NLP) field. We introduce a novel Parallel Recursive Deep Model (PRDM) for predicting sentiment label distributions. The main trait of our model is to not only use the composition units, i.e., the vector of word, phrase and sentiment label with them, but also exploit the information encoded among the structure of sentiment label, by introducing a sentiment Recursive Neural Network (sentiment-RNN) together with RNTN. The two parallel neural networks together compose of our novel deep model structure, in which Sentiment-RNN and RNTN cooperate with each other. On predicting sentiment label distributions task, our model outperforms previous state of the art approaches on both full sentences level and phrases level by a large margin.

Changliang Li, Bo Xu, Gaowei Wu, Saike He, Guanhua Tian, Yujun Zhou
Sentiment Analysis in Transcribed Utterances

A single phone call can make or break a valuable customer-organization relationship. Maintaining good quality of service can lead to customer loyalty, which affects profitability. Traditionally, customer feedback is mainly collected by interviews, questionnaires, and surveys; the major drawback of these data collection methods is in their limited scale. The growing amount of research conducted in the field of sentiment analysis, combined with advances in text processing and Artificial Intelligence, has led us be the first to present an intelligent system for mining sentiment from transcribed utterances—wherein the noisiness property and short length poses extra challenges to sentiment analysis. Our aim is to detect and process affective factors from multiple layers of information, and study the effectiveness and robustness of each factor type independently, by proposing a tailored machine learning paradigm. Three types of factors are related to the textual content while two overlook it. Experiments are carried out on two datasets of transcribed phone conversations, obtained from real-world telecommunication companies.

Nir Ofek, Gilad Katz, Bracha Shapira, Yedidya Bar-Zev
Rating Entities and Aspects Using a Hierarchical Model

Opinion rating has been studied for a long time and recent work started to pay attention to topical aspects opinion rating, for example, the food quality, service, location and price of a restaurant. In this paper, we focus on predicting the overall and aspect rating of entities based on widely available on-line reviews. A novel hierarchical Bayesian generative method is developed for this task. It enables us to mine the overall and aspect ratings of both entity and its reviews at the same time. We conduct experiments on TripAdvisor and results show that we can predict entity-level and review-level overall ratings and aspect ratings well.

Xun Wang, Katsuhito Sudoh, Masaaki Nagata
Sentiment Analysis on Microblogging by Integrating Text and Image Features

Most studies about sentiment analysis on microblogging usually focus on the features mining from the text. This paper presents a new sentiment analysis method by combing features from text with features from image. Bigram model is applied in text feature extraction while color and texture information are extracted from images. Considering the sentiment classification, we propose a new neighborhood classier based on the similarity of two instances described by the fusion of text and features. Experimental results show that our proposed method can improve the performance significantly on Sina Weibo data (we collect and label the data). We find that our method can not only increasingly improve the F values of the classification comparing with only used text or images features, but also outperforms the NaiveBayes and SVM classifiers using all features with text and images.

Yaowen Zhang, Lin Shang, Xiuyi Jia
TSum4act: A Framework for Retrieving and Summarizing Actionable Tweets During a Disaster for Reaction

Social networks (e.g. Twitter) have been proved to be an almost real-time mean of information spread, thus they can be exploited as a valuable channel of information for emergencies (e.g. disasters) during which people need updated information for suitable reactions. In this paper, we present

TSum4act

, a framework designed to tackle the challenges of tweets (e.g. diversity, large volume, and noise) for disaster responses. The objective of the framework is to retrieve actionable tweets (e.g. casualties, cautions, and donations) that were posted during disasters. For this purpose, the framework first identifies informative tweets to remove noise; then assigns informative tweets into topics to preserve the diversity; next summarizes the topics to be compact; and finally ranks the results for user’s faster scan. In order to improve the performance, we proposed to incorporate event extraction for enriching the semantics of tweets.

TSum4act

has been successfully tested on Joplin tornado dataset of 230.535 tweets and the completeness of 0.58 outperformed 17%, of the retweet baseline’s.

Minh-Tien Nguyen, Asanobu Kitamoto, Tri-Thanh Nguyen

Clustering

Frontmatter
Evolving Chinese Restaurant Processes for Modeling Evolutionary Traces in Temporal Data

Due to the evolving nature of temporal data, clusters often exhibit complex dynamic patterns like birth and death. In particular, a cluster can branch into multiple clusters simultaneously. Intuitively, clusters can evolve as evolutionary trees over time. However, existing models are incapable of recovering the tree-like evolutionary trace in temporal data. To this end, we propose an Evolving Chinese Restaurant Process (ECRP), which is essentially a temporal non-parametric clustering model. ECRP incorporates dynamics of cluster number, parameters and popularity. ECRP allows each cluster to have multiple branches over time. We design an online learning framework based on Gibbs sampling to infer the evolutionary traces of clusters over time. In experiments, we validate that ECRP can capture tree-like evolutionary traces of clusters from real-world data sets and achieve better clustering results than the state-of-the-art methods.

Peng Wang, Chuan Zhou, Peng Zhang, Weiwei Feng, Li Guo, Binxing Fang
Small-Variance Asymptotics for Bayesian Nonparametric Models with Constraints

The users often have additional knowledge when Bayesian nonparametric models (BNP) are employed, e.g. for clustering there may be prior knowledge that some of the data instances should be in the same cluster (must-link constraint) or in different clusters (cannot-link constraint), and similarly for topic modeling some words should be grouped together or separately because of an underlying semantic. This can be achieved by imposing appropriate sampling probabilities based on such constraints. However, the traditional inference technique of BNP models via Gibbs sampling is time consuming and is not scalable for large data. Variational approximations are faster but many times they do not offer good solutions. Addressing this we present a small-variance asymptotic analysis of the MAP estimates of BNP models with constraints. We derive the objective function for Dirichlet process mixture model with constraints and devise a simple and efficient K-means type algorithm. We further extend the small-variance analysis to hierarchical BNP models with constraints and devise a similar simple objective function. Experiments on synthetic and real data sets demonstrate the efficiency and effectiveness of our algorithms.

Cheng Li, Santu Rana, Dinh Phung, Svetha Venkatesh
Spectral Clustering for Large-Scale Social Networks via a Pre-Coarsening Sampling based NystrÖm Method

Spectral clustering has exhibited a superior performance in analyzing the cluster structure of network. However, the exponentially computational complexity limits its application in analyzing large-scale social networks. To tackle this problem, many low-rank matrix approximating algorithms are proposed, of which the NystrÖm method is an approach with proved lower approximate errors. Currently, most existing sampling techniques for NystrÖm method are designed on affinity matrices, which are time-consuming to compute by some similarity metrics. Moreover, the social networks are often built on link relations, in which there is no information to construct an affinity matrix for the approximate computing of NystrÖm method for spectral clustering except for the degrees of nodes. This paper proposes a spectral clustering algorithm for large-scale social networks via a pre-coarsening sampling based NystrÖm method. By virtue of a novel triangle-based coarsening policy, the proposed algorithm first shrinks the social network into a smaller weighted network, and then does an efficient sampling for NystrÖm method to approximate the eigen-decomposition of matrix of spectral clustering. Experimental results on real large-scale social networks demonstrate that the proposed algorithm outperforms the state-of-the-art spectral clustering algorithms, which are realized by the existing sampling techniques based NystrÖm methods.

Ying Kang, Bo Yu, Weiping Wang, Dan Meng
pcStream: A Stream Clustering Algorithm for Dynamically Detecting and Managing Temporal Contexts

The clustering of unbounded data-streams is a difficult problem since the observed instances cannot be stored for future clustering decisions. Moreover, the probability distribution of streams tends to change over time, making it challenging to differentiate between a concept-drift and an anomaly. Although many excellent data-stream clustering algorithms have been proposed in the past, they are not suitable for capturing the temporal contexts of an entity.

In this paper, we propose pcStream; a novel data-stream clustering algorithm for dynamically detecting and managing sequential temporal contexts. pcStream takes into account the properties of sensor-fused data-streams in order to accurately infer the present concept, and dynamically detect new contexts as they occur. Moreover, the algorithm is capable of detecting point anomalies and can operate with high velocity data-streams. Lastly, we show in our evaluation that pcStream outperforms state-of-the-art stream clustering algorithms in detecting real world contexts from sensor-fused datasets. We also show how pcStream can be used as an analysis tool for contextual sensor streams.

Yisroel Mirsky, Bracha Shapira, Lior Rokach, Yuval Elovici
Clustering Over Data Streams Based on Growing Neural Gas

Clustering data streams requires a process capable of partitioning observations continuously with restrictions of memory and time. In this paper we present a new algorithm, called G-Stream, for clustering data streams by making one pass over the data. G-Stream is based on growing neural gas, that allows us to discover clusters of arbitrary shape without any assumptions on the number of clusters. By using a reservoir, and applying a fading function, the quality of clustering is improved. The performance of the proposed algorithm is evaluated on public data sets.

Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag
Computing and Mining ClustCube Cubes Efficiently

A

novel computational paradigm for clustering complex database objects extracted from distributed database settings via well-understood OLAP technology

is proposed and experimentally assessed in this paper. This paradigm conveys in the so-called ClustCube cubes, which define

a novel multidimensional data cube model

according to which (data) cubes store

clustered complex database objects

rather than conventional SQL-based aggregations. A major contribution of this research is represented by effective and efficient algorithms for computing ClustCube cubes that, surprisingly, are capable of reducing computational efforts significantly with respect to traditional approaches. Our analytical contribution is completed by a comprehensive assessment of proposed algorithms against both benchmark and real-life data sets, which clearly confirms the benefits deriving from our proposal.

Alfredo Cuzzocrea

Outlier and Anomaly Detection

Frontmatter
Contextual Anomaly Detection Using Log-Linear Tensor Factorization

This paper presents a novel approach for the detection of contextual anomalies. This approach, based on log-linear tensor factorization, considers a stream of discrete events, each representing the co-occurence of contextual elements, and detects events with low-probability. A parametric model is used to learn the joint probability of contextual elements, in which the parameters are the factors of the event tensor. An efficient method, based on Nesterov’s accelerated gradient ascent, is proposed to learn these parameters. The proposed approach is evaluated on the low-rank approximation of tensors, the prediction of future of events and the detection of events representing abnormal behaviors. Results show our method to outperform state of the art approaches for these problems.

Alpa Jayesh Shah, Christian Desrosiers, Robert Sabourin
A Semi-Supervised Framework for Social Spammer Detection

Spammers create large number of compromised or fake accounts to disseminate harmful information in social networks like Twitter. Identifying social spammers has become a challenging problem. Most of existing algorithms for social spammer detection are based on supervised learning, which needs a large amount of labeled data for training. However, labeling sufficient training set costs too much resources, which makes supervised learning impractical for social spammer detection. In this paper, we propose a semi-supervised framework for social spammer detection(SSSD), which combines the supervised classification model with a ranking scheme on the social graph. First, we train an original classifier with a small number of labeled data. Second, we propose a ranking model to propagate trust and distrust on the social graph. Third, we select confident users that are judged by the classifier and ranking scores as new training data and retrain the classifier. We repeat the all steps above until the classifier cannot be refined any more. Experimental results show that our framework can effectively detect social spammers in the condition of lacking sufficient labeled data.

Zhaoxing Li, Xianchao Zhang, Hua Shen, Wenxin Liang, Zengyou He
Fast One-Class Support Vector Machine for Novelty Detection

Novelty detection arises as an important learning task in several applications. Kernel-based approach to novelty detection has been widely used due to its theoretical rigor and elegance of geometric interpretation. However, computational complexity is a major obstacle in this approach. In this paper, leveraging on the cutting-plane framework with the well-known One-Class Support Vector Machine, we present a new solution that can scale up seamlessly with data. The first solution is exact and linear when viewed through the cutting-plane; the second employed a sampling strategy that remarkably has a constant computational complexity defined relatively to the probability of approximation accuracy. Several datasets are benchmarked to demonstrate the credibility of our framework.

Trung Le, Dinh Phung, Khanh Nguyen, Svetha Venkatesh
ND-Sync: Detecting Synchronized Fraud Activities

Given the retweeting activity for the posts of several Twitter users, how can we distinguish organic activity from spammy retweets by paid followers to boost a post’s appearance of popularity? More generally, given groups of observations, can we spot strange groups? Our main intuition is that organic behavior has more variability, while fraudulent behavior, like retweets by botnet members, is more synchronized. We refer to the detection of such

synchronized

observations as the

Synchonization Fraud

problem, and we study a specific instance of it,

Retweet Fraud Detection

, manifested in Twitter. Here, we propose: (A)

ND-Sync

, an efficient method for detecting

group fraud

, and (B) a set of carefully designed features for characterizing retweet threads.

ND-Sync

is

effective

in spotting retweet fraudsters,

robust

to different types of abnormal activity, and

adaptable

as it can easily incorporate additional features. Our method achieves a 97% accuracy on a real dataset of 12 million retweets crawled from Twitter.

Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Alex Beutel, Christos Faloutsos, Athena Vakali
An Embedding Scheme for Detecting Anomalous Block Structured Graphs

Graph-based anomaly detection plays a vital role in various application domains such as network intrusion detection, social network analysis and road traffic monitoring. Although these evolving networks impose a curse of dimensionality on the learning models, they usually contain structural properties that anomaly detection schemes can exploit. The major challenge is finding a feature extraction technique that preserves graph structure while balancing the accuracy of the model against its scalability. We propose the use of a scalable technique known as random projection as a method for structure aware embedding, which extracts relational properties of the network, and present an analytical proof of this claim. We also analyze the effect of embedding on the accuracy of one-class support vector machines for anomaly detection on real and synthetic datasets. We demonstrate that the embedding can be effective in terms of scalability without detrimental influence on the accuracy of the learned model.

Lida Rashidi, Sutharshan Rajasegarar, Christopher Leckie
A Core-Attach Based Method for Identifying Protein Complexes in Dynamic PPI Networks

Indentifying protein complexes is essential to understanding the principles of cellular systems. Many computational methods have been developed to identify protein complexes in static protein-protein interaction (PPI) network. However, PPI network changes over time, the important dynamics within PPI network is overlooked by these methods. Therefore, discovering complexes in dynamic PPI networks (DPN) is important. DPN contains a series of time-sequenced subnetworks which represent PPI at different time points of cell cycle. In this paper, we propose a dynamic core-attachment algorithm (DCA) to discover protein complexes in DPN. Based on core-attachment assumption, we first detect cores which are small, dense subgraphs and frequently active in the DPN, and then we form complexes by adding short-lived attachments to cores. We apply our DCA to the data of S.cerevisiae and the experimental result shows that DCA outperforms six other complex discovery algorithms, moreover, it reveals that our DCA not only provides dynamic information but also discovers more accurate protein complexes.

Jiawei Luo, Chengchen Liu, Hoang Tu Nguyen

Mining Uncertain and Imprecise Data

Frontmatter
Mining Uncertain Sequential Patterns in Iterative MapReduce

This paper proposes a sequential pattern mining (SPM) algorithm in large scale uncertain databases. Uncertain sequence databases are widely used to model inaccurate or imprecise timestamped data in many real applications, where traditional SPM algorithms are inapplicable because of data uncertainty and scalability. In this paper, we develop an efficient approach to manage data uncertainty in SPM and design an iterative MapReduce framework to execute the uncertain SPM algorithm in parallel. We conduct extensive experiments in both synthetic and real uncertain datasets. And the experimental results prove that our algorithm is efficient and scalable.

Jiaqi Ge, Yuni Xia, Jian Wang
Quality Control for Crowdsourced POI Collection

Crowdsourcing allows human intelligence tasks to be outsourced to a large number of unspecified people at low costs. However, because of the uneven ability and diligence of crowd workers, the quality of their submitted work is also uneven and sometimes quite low. Therefore, quality control is one of the central issues in crowdsourcing research. In this paper, we consider a quality control problem of POI (points of interest) collection tasks, in which workers are asked to enumerate location information of POIs. Since workers neither necessarily provide correct answers nor provide exactly the same answers even if the answers indicate the same place, we propose a two-stage quality control method consisting of an answer clustering stage and a reliability estimation stage. Implemented with a new constrained exemplar clustering and a modified HITS algorithm, the effectiveness of our method is demonstrated as compared to baseline methods on several real crowdsourcing datasets.

Shunsuke Kajimura, Yukino Baba, Hiroshi Kajino, Hisashi Kashima
Towards Efficient Sequential Pattern Mining in Temporal Uncertain Databases

Uncertain sequence databases are widely used to model data with inaccurate or imprecise timestamps in many real world applications. In this paper, we use uniform distributions to model uncertain timestamps and adopt possible world semantics to interpret temporal uncertain database. We design an incremental approach to manage temporal uncertainty efficiently, which is integrated into the classic pattern-growth SPM algorithm to mine uncertain sequential patterns. Extensive experiments prove that our algorithm performs well in both efficiency and scalability.

Jiaqi Ge, Yuni Xia, Jian Wang
Preference-Based Top-k Representative Skyline Queries on Uncertain Databases

Top-

k

representative skyline queries are important for multi-criteria decision making applications since they provide an intuitive way to identify the

k

most significant objects for data analysts. Despite their importance, top-

k

representative skyline queries have not received adequate attention from the research community. Existing work addressing the problem focuses only on certain data models. For this reason, in this paper, we present the first study on processing top-

k

representative skyline queries in uncertain databases, based on user-defined references, regarding the priority of individual dimensions. We also apply the odds ratio to restrict the cardinality of the result set, instead of using a threshold which might be difficult for an end-user to define. We then develop two novel algorithms for answering top-

k

representative skyline queries on uncertain data. In addition, several pruning conditions are proposed to enhance the efficiency of our proposed algorithms. Performance evaluations are conducted on both real-life and synthetic datasets to demonstrate the efficiency, effectiveness and scalability of our proposed approaches.

Ha Thanh Huynh Nguyen, Jinli Cao
Cluster Sequence Mining: Causal Inference with Time and Space Proximity Under Uncertainty

We propose a pattern mining algorithm for numerical multidimensional event sequences, called cluster sequence mining (CSM). CSM extracts patterns with a pair of clusters that satisfies space proximity of the individual clusters and time proximity in time intervals between events from different clusters. CSM is an extension of a unique algorithm (co-occurrence cluster mining (CCM)), considering the order of events and the distribution of time intervals. The probability density of the time intervals is inferred by utilizing Bayesian inference for robustness against uncertainty. In an experiment using synthetic data, we confirmed that CSM is capable of extracting clusters with high F-measure and low estimation error of the time interval distribution even under uncertainty. CSM was applied to an earthquake event sequence in Japan after the 2011 Tohoku Earthquake to infer causality of earthquake occurrences. The results demonstrate that CSM suggests some high affecting/affected areas in the subduction zone farther away from the main shock of the Tohoku Earthquake.

Yoshiyuki Okada, Ken-ichi Fukui, Koichi Moriyama, Masayuki Numao
Achieving Accuracy Guarantee for Answering Batch Queries with Differential Privacy

In this paper, we develop a novel strategy for the privacy budget allocation on answering a batch of queries for statistical databases under differential privacy framework. Under such a strategy, the noisy results are more meaningful and achieve better utility of the dataset. In particular, we first formulate the privacy allocation as an optimization problem. Then derive explicit approximation of the relationships among privacy budget, dataset size and confidence interval. Based on the derived formulas, one can automatically determine optimal privacy budget allocation for batch queries with the given accuracy requirements. Extensive experiments across a synthetic dataset and a real dataset are conducted to demonstrate the effectiveness of the proposed approach.

Dong Huang, Shuguo Han, Xiaoli Li

Mining Temporal and Spatial Data

Frontmatter
Automated Classification of Passing in Football

A knowledgeable observer of a game of football (soccer) can make a subjective evaluation of the quality of passes made between players during the game. In this paper we consider the problem of producing an automated system to make the same evaluation of passes. We present a model that constructs numerical predictor variables from spatiotemporal match data using feature functions based on methods from computational geometry, and then learns a classification function from labelled examples of the predictor variables. In addition, we show that the predictor variables computed using methods from computational geometry are among the most important to the learned classifiers.

Michael Horton, Joachim Gudmundsson, Sanjay Chawla, Joël Estephan
Stabilizing Sparse Cox Model Using Statistic and Semantic Structures in Electronic Medical Records

Stability in clinical prediction models is crucial for transferability between studies, yet has received little attention. The problem is paramount in high dimensional data, which invites sparse models with feature selection capability. We introduce an effective method to stabilize sparse Cox model of time-to-events using statistical and semantic structures inherent in Electronic Medical Records (EMR). Model estimation is stabilized using three feature graphs built from (i) Jaccard similarity among features (ii) aggregation of Jaccard similarity graph and a recently introduced semantic EMR graph (iii) Jaccard similarity among features transferred from a related cohort. Our experiments are conducted on two real world hospital datasets: a heart failure cohort and a diabetes cohort. On two stability measures – the Consistency index and signal-to-noise ratio (SNR) – the use of our proposed methods significantly increased feature stability when compared with the baselines.

Shivapratap Gopakumar, Tu Dinh Nguyen, Truyen Tran, Dinh Phung, Svetha Venkatesh
Predicting Next Locations with Object Clustering and Trajectory Clustering

Next location prediction is of great importance for many location based applications. In many cases, understanding the similarity between objects and the similarity between trajectories may lead to more accurate predictions. In this paper, we propose two novel models exploiting these two types of similarities respectively. The first model, named

object-clustered Markov model

(object-MM)

, first clusters similar objects based on their spatial localities, and then builds variable-order Markov models with the trajectories of objects in the same cluster. The second model, named

trajectory-clustered Markov model

(tra-MM)

, considers the similarity between trajectories, and clusters the trajectories to form the training set used in building the Markov models. The two models are integrated to produce the final next location predictor

(objectTra-MM)

. Experiments based on a real data set demonstrate significant increase in prediction accuracy of

objectTra-MM

over existing methods.

Meng Chen, Yang Liu, Xiaohui Yu
A Plane Moving Average Algorithm for Short-Term Traffic Flow Prediction

In this paper, a plane moving average algorithm is proposed for solving the urban road flow forecasting problem. This new approach assembles information from relevant traffic time series and has the following advantages: (1) it integrates both individual and similar flow patterns in making prediction, (2) the training data set does not need to be large, (3) it has more generalization capabilities in predicting unpredictable and much complex urban traffic flow than previously used methods. To assess the new model, we have performed extensive experiments on a real data set, and the results give evidence of its superiority over existing methods.

Lei Lv, Meng Chen, Yang Liu, Xiaohui Yu
Recommending Profitable Taxi Travel Routes Based on Big Taxi Trajectories Data

Recommending routes with the shortest cruising distance based on big taxi trajectories is an active research topic. In this paper, we first introduce a temporal probability grid network generated from the taxi trajectories, then a profitable route recommendation algorithm called Adaptive Shortest Expected Cruising Route (ASECR) algorithm is proposed. ASECR recommends profitable routes based on assigned potential profitable grids and updates the profitable route constantly based on taxis’ movements as well as utilizing the temporal probability grid network dynamically. To handle the big trajectory data and improve the efficiency of updating route constantly, a data structure kdS-tree is proposed and implemented for ASECR. The experiments on two real taxi trajectory datasets demonstrate the effectiveness and efficiency of the proposed algorithm.

Wenxin Yang, Xin Wang, Seyyed Mohammadreza Rahimi, Jun Luo
Semi Supervised Adaptive Framework for Classifying Evolving Data Stream

Most of the approaches for classifying evolving data stream divide the stream into fixed size chunks to address infinite length and concept drift problems. These approaches suffer from trade-off between performance and sensitivity. To address this problem, existing adaptive sliding window techniques determine chunk boundaries dynamically by detecting changes in classifier error rate which requires true labels for all of the data instances. However, true labels are scarce and often delayed in reality. In this paper, we propose an approach which determines dynamic chunk boundaries by detecting significant changes in classifier confidence scores using only limited number of labeled data instances. Moreover, we integrate suitable classification technique with it to propose a complete semi supervised framework which uses dynamic chunk boundaries to address concept drift and concept evolution efficiently. Results from the experiments using benchmark data sets show the effectiveness of our proposed framework in terms of handling both concept drift and concept evolution.

Ahsanul Haque, Latifur Khan, Michael Baron

Mining Temp oral and Spatial Data

Cost-Sensitive Feature Selection on Heterogeneous Data

Evaluation functions, used to measure the quality of features, have great influence on the feature selection algorithms in areas of data mining and knowledge discovery. However, the existing evaluation functions are often inadequately measured candidate features on cost-sensitive heterogeneous data. To address this problem, an entropy-based evaluation function is firstly proposed for measuring the uncertainty for heterogeneous data. To further evaluate the quality of candidate features, we propose a multi-criteria based evaluation function, which attempts to find candidate features with the minimal total costs and the same information as the whole feature set. On this basis, a cost-sensitive feature selection algorithm on heterogeneous data is developed. Compared with the existing feature selection algorithms, the experimental results show that the proposed algorithm is more efficient to find a subset of features without losing the classification performance.

Wenbin Qian, Wenhao Shu, Jun Yang, Yinglong Wang
A Feature Extraction Method for Multivariate Time Series Classification Using Temporal Patterns

Multiple variables and high dimensions are two main challenges for classification of Multivariate Time Series (MTS) data. In order to overcome these challenges, feature extraction should be performed before performing classification. However, the existing feature extraction methods lose the important correlations among the variables while reducing high dimensions of MTS. Hence, in this paper, we propose a new feature extraction method combined with different classifiers to provide a general classification strategy for MTS data which can be applied for different area problems of MTS data. The proposed algorithm can handle data of high feature dimensions efficiently with unequal length and discover the relationship within the same and between different component univariate time series for MTS data. Hence, the proposed feature extraction method is application-independent and therefore does not depend on domain knowledge of relevant features or assumption about underling data models. We evaluate the algorithm on one synthetic dataset and two real-world datasets. The comparison experimental result shows that the proposed algorithm can achieve higher classification accuracy and F-measure value.

Pei-Yuan Zhou, Keith C. C. Chan
Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

In outlying aspects mining, given a query object, we aim to answer the question as to what features make the query most outlying. The most recent works tackle this problem using two different strategies. (i) Feature selection approaches select the features that best distinguish the two classes: the query point vs. the rest of the data. (ii) Score-and-search approaches define an outlyingness score, then search for subspaces in which the query point exhibits the best score. In this paper, we first present an insightful theoretical result connecting the two types of approaches. Second, we present

OARank

– a hybrid framework that leverages the efficiency of feature selection based approaches and the effectiveness and versatility of score-and-search based methods. Our proposed approach is orders of magnitudes faster than previously proposed score-and-search based approaches while being slightly more effective, making it suitable for mining large data sets.

Nguyen Xuan Vinh, Jeffrey Chan, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, Jian Pei
A DC Programming Approach for Sparse Optimal Scoring

We consider the supervised classification problem in the high-dimensional setting. High-dimensionality makes the application of most classification difficult. We present a novel approach to the sparse linear discriminant analysis (LDA) based on its optimal scoring interpretation and the zero-norm. The difficulty in treating the zero-norm is overcome by using an appropriate continuous approximation such that the resulting problem can be formulated as a DC (Difference of Convex functions) program to which DCA (DC Algorithms) is investigated. The computational results on both simulated data and real microarray cancer data show the efficiency of the proposed algorithm in feature selection as well as classification.

Hoai An Le Thi, Duy Nhat Phan
Graph Based Relational Features for Collective Classification

Statistical Relational Learning (SRL) methods have shown that classification accuracy can be improved by integrating relations between samples. Techniques such as

iterative classification

or

relaxation labeling

achieve this by propagating information between related samples during the inference process. When only a few samples are labeled and connections between samples are sparse,

collective inference

methods have shown large improvements over

standard feature-based

ML methods. However, in contrast to feature based ML, collective inference methods require complex inference procedures and often depend on the strong assumption of label consistency among related samples. In this paper, we introduce new

relational features

for standard ML methods by extracting information from

direct

and

indirect relations

. We show empirically on three standard benchmark datasets that our relational features yield results comparable to collective inference methods. Finally we show that our proposal outperforms these methods when additional information is available.

Immanuel Bayer, Uwe Nagel, Steffen Rendle
A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply

$$p$$

-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.

Thanh-Tung Nguyen, He Zhao, Joshua Zhexue Huang, Thuy Thi Nguyen, Mark Junjie Li

Mining Heterogeneous, High Dimensional, and Sequential Data

Frontmatter
Seamlessly Integrating Effective Links with Attributes for Networked Data Classification

Networked data is emerging with great amount in various fields like social networks, biological networks, research publication networks, etc. Networked data classification is therefore of critical importance in real world, and it is noticed that link information can help improve learning performance. However, classification of such networked data can be challenging since: 1) the original links (also referred as relations) in such networks, are always sparse, incomplete and noisy; 2) it is not easy to characterize, select and leverage effective link information from the networks, involving multiple types of links with distinct semantics; 3) it is difficult to seamlessly integrate link information with attribute information in a network. To address these limitations, in this paper we develop a novel Seamlessly-integrated Link-Attribute Collective Matrix Factorization (SLA-CMF) framework, which mines highly effective link information given arbitrary information network and leverages it with attribute information in a unified perspective. Algorithmwise, SLA-CMF first mines highly effective link information via link path weighting and link strength learning. Then it learns a low-dimension link-attribute joint representation via graph Laplacian CMF. Finally the joint representation is put into a traditional classifier such as SVM for classification. Extensive experiments on benchmark datasets demonstrate the effectiveness of our method.

Yangyang Zhao, Zhengya Sun, Changsheng Xu, Hongwei Hao
Clustering on Multi-source Incomplete Data via Tensor Modeling and Factorization

With advances in data collection technologies, multiple data sources are assuming increasing prominence in many applications. Clustering from multiple data sources has emerged as a topic of critical significance in the data mining and machine learning community. Different data sources provide different levels of necessarily detailed knowledge. Thus, combining multiple data sources is pivotal to facilitate the clustering process. However, in reality, the data usually exhibits heterogeneity and incompleteness. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of missing data. Conventional methods mainly focus on clustering heterogeneous data with full information in all sources or at least one source without missing values. In this paper, we propose a more general framework T-MIC (

T

ensor based

M

ulti-source

I

ncomplete data

C

lustering) to integrate multiple incomplete data sources. Specifically, we first use the kernel matrices to form an initial tensor across all the multiple sources. Then we formulate a joint tensor factorization process with the sparsity constraint and use it to iteratively push the initial tensor towards a quality-driven exploration of the latent factors by taking into account missing data uncertainty. Finally, these factors serve as features to clustering. Extensive experiments on both synthetic and real datasets demonstrate that our proposed approach can effectively boost clustering performance, even with large amounts of missing data.

Weixiang Shao, Lifang He, Philip S. Yu
Locally Optimized Hashing for Nearest Neighbor Search

Fast nearest neighbor search (NNS) is becoming important to utilize massive data. Recent work shows that hash learning is effective for NNS in terms of computational time and space. Existing hash learning methods try to convert neighboring samples to similar binary codes, and their hash functions are globally optimized on the data manifold. However, such hash functions often have low resolution of binary codes; each bucket, a set of samples with same binary code, may contain a large number of samples in these methods, which makes it infeasible to obtain the nearest neighbors of given query with high precision. As a result, existing methods require long binary codes for precise NNS. In this paper, we propose

Locally Optimized Hashing

to overcome this drawback, which explicitly partitions each bucket by solving optimization problem based on that of Spectral Hashing with stronger constraints. Our method outperforms existing methods in image and document datasets in terms of quality of both the hash table and query, especially when the code length is short.

Seiya Tokui, Issei Sato, Hiroshi Nakagawa
Do-Rank: DCG Optimization for Learning-to-Rank in Tag-Based Item Recommendation Systems

Discounted Cumulative Gain (DCG) is a well-known ranking evaluation measure for models built with multiple relevance graded data. By handling tagging data used in recommendation systems as an ordinal relevance set of

$$ \left\{ {negative,null,positive} \right\} $$

, we propose to build a DCG based recommendation model. We present an efficient and novel learning-to-rank method by optimizing DCG for a recommendation model using the tagging data interpretation scheme. Evaluating the proposed method on real-world datasets, we demonstrate that the method is scalable and outperforms the benchmarking methods by generating a quality top-

N

item recommendation list.

Noor Ifada, Richi Nayak
Efficient Discovery of Recurrent Routine Behaviours in Smart Meter Time Series by Growing Subsequences

Data mining techniques have been developed to automatically learn consumption behaviours of households from smart meter data. In this paper, recurrent routine behaviours are introduced to characterize regular consumption activities in smart meter time series. A novel algorithm is proposed to efficiently discover recurrent routine behaviours in smart meter time series by growing subsequences. We evaluate the proposed algorithm on synthetic data and demonstrate the recurrent routine behaviours extracted on a real-world dataset from the city of Kalgoorlie-Boulder in Western Australia.

Jin Wang, Rachel Cardell-Oliver, Wei Liu
Convolutional Nonlinear Neighbourhood Components Analysis for Time Series Classification

During last decade, tremendous efforts have been devoted to the research of time series classification. Indeed, many previous works suggested that the simple nearest-neighbor classification is effective and difficult to beat. However, we usually need to determine the distance metric (e.g., Euclidean distance and Dynamic Time Warping) for different domains, and current evidence shows that there is no distance metric that is best for all time series data. Thus, the choice of distance metric has to be done empirically, which is time expensive and not always effective. To automatically determine the distance metric, in this paper, we investigate the distance metric learning and propose a novel Convolutional Nonlinear Neighbourhood Components Analysis model for time series classification. Specifically, our model performs supervised learning to project original time series into a transformed space. When classifying, nearest neighbor classifier is then performed in this transformed space. Finally, comprehensive experimental results demonstrate that our model can improve the classification accuracy to some extent, which indicates that it can learn a good distance metric.

Yi Zheng, Qi Liu, Enhong Chen, J. Leon Zhao, Liang He, Guangyi Lv

Entity Resolution and Topic Modelling

Frontmatter
Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage

The identification of common sets of records in multiple databases has become an increasingly important subject in many application areas, including banking, health, and national security. Often privacy concerns and regulations prevent the owners of the databases from sharing any sensitive details of their records with each other, and with any other party. The linkage of records in multiple databases while preserving privacy and confidentiality is an emerging research discipline known as privacy-preserving record linkage (PPRL). We propose a novel two-step indexing (blocking) approach for PPRL between multiple (more than two) parties. First, we generate small mini-blocks using a multi-bit Bloom filter splitting method and second we merge these mini-blocks based on their similarity using a novel hierarchical canopy clustering technique. An empirical study conducted with large datasets of up-to one million records shows that our approach is scalable with the size of the datasets and the number of parties, while providing better privacy than previous multi-party indexing approaches.

Thilina Ranbaduge, Dinusha Vatsalan, Peter Christen
Efficient Interactive Training Selection for Large-Scale Entity Resolution

Entity resolution (ER) has wide-spread applications in many areas, including e-commerce, health-care, the social sciences, and crime and fraud detection. A crucial step in ER is the accurate classification of pairs of records into matches (assumed to refer to the same entity) and non-matches (assumed to refer to different entities). In most practical ER applications it is difficult and costly to obtain training data of high quality and enough size, which impedes the learning of an ER classifier. We tackle this problem using an interactive learning algorithm that exploits the cluster structure in similarity vectors calculated from compared record pairs. We select informative training examples to assess the purity of clusters, and recursively split clusters until clusters pure enough for training are found. We consider two aspects of active learning that are significant in practical applications: a limited budget for the number of manual classifications that can be done, and a noisy oracle where manual labeling might be incorrect. Experiments using several real data sets show that manual labeling efforts can be significantly reduced for training an ER classifier without compromising matching quality.

Qing Wang, Dinusha Vatsalan, Peter Christen
Unsupervised Blocking Key Selection for Real-Time Entity Resolution

Real-time entity resolution (ER) is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing is a major step in the ER process, aimed at reducing the search space by bringing similar records closer to each other using a blocking key criterion. Selecting these keys is crucial for the effectiveness and efficiency of the real-time ER process. Traditional indexing techniques require domain knowledge for optimal key selection. However, to make the ER process less dependent on human domain knowledge, automatic selection of optimal blocking keys is required. In this paper we propose an unsupervised learning technique that automatically selects optimal blocking keys for building indexes that can be used in real-time ER. We specifically learn multiple keys to be used with multi-pass sorted neighbourhood, one of the most efficient and widely used indexing techniques for ER. We evaluate the proposed approach using three real-world data sets, and compare it with an existing automatic blocking key selection technique. The results show that our approach learns optimal blocking/sorting keys that are suitable for real-time ER. The learnt keys significantly increase the efficiency of query matching while maintaining the quality of matching results.

Banda Ramadan, Peter Christen
Incorporating Probabilistic Knowledge into Topic Models

Probabilistic Topic Models could be used to extract low-dimension aspects from document collections. However, such models without any human knowledge often produce aspects that are not interpretable. In recent years, a number of knowledge-based models have been proposed, which allow the user to input prior knowledge of the domain to produce more coherent and meaningful topics. In this paper, we incorporate human knowledge in the form of probabilistic knowledge base into topic models. By combining latent Dirichlet allocation, a widely used topic model with Probase, a large-scale probabilistic knowledge base, we improve the semantic coherence significantly. Our evaluation results will demonstrate the effectiveness of our method.

Liang Yao, Yin Zhang, Baogang Wei, Hongze Qian, Yibing Wang
Learning Focused Hierarchical Topic Models with Semi-Supervision in Microblogs

Topic modeling approaches, such as Latent Dirichlet Allocation (LDA) and Hierarchical LDA (hLDA) have been used extensively to discover topics in various corpora. Unfortunately, these approaches do not perform well when applied to collections of social media posts. Further, these approaches do not allow users to focus topic discovery around subjectively interesting concepts. We propose the new Semi-Supervised Microblog-hLDA (SS-Micro-hLDA) model to discover topic hierarchies in short, noisy microblog documents in a way that allows users to focus topic discovery around interesting areas. We test SS-Micro-hLDA using a large, public collection of Twitter messages and Reddit social blogging site and show that our model outperforms hLDA, Constrained-hLDA, Recursive-rCRP and TSSB in terms of Pointwise Mutual Information (PMI) Score. Further, we test our model in terms of information entropy of held-out data and show that the new approach produces highly focused topic hierarchies.

Anton Slutsky, Xiaohua Hu, Yuan An
Predicting Future Links Between Disjoint Research Areas Using Heterogeneous Bibliographic Information Network

Literature-based discovery aims to discover hidden connections between previously disconnected research areas. Heterogeneous bibliographic information network (HBIN) provides a latent, semi-structured, bibliographic information model to signal the potential connections between scientific papers. This paper introduces a novel literature-based discovery method that builds meta path features from HBIN network to predict co-citation links between previously disconnected literatures. We evaluated the performance of our method in predicting future co-citation links between fish oil and Raynaud’s syndrome papers. Our experimental results showed that HBIN meta path features could predict future co-citation links between these papers with high accuracy (0.851 F-Measure; 0.845 precision; 0.857 recall), outperforming the existing document similarity algorithms such as LDA, TF-IDF, and Bibliographic Coupling.

Yakub Sebastian, Eu-Gene Siew, Sylvester Olubolu Orimaye

Itemset and High Performance Data Mining

Frontmatter
CPT+: Decreasing the Time/Space Complexity of the Compact Prediction Tree

Predicting next items of sequences of symbols has many applications in a wide range of domains. Several sequence prediction models have been proposed such as DG, All-k-order markov and PPM. Recently, a model named Compact Prediction Tree (CPT) has been proposed. It relies on a tree structure and a more complex prediction algorithm to offer considerably more accurate predictions than many state-of-the-art prediction models. However, an important limitation of CPT is its high time and space complexity. In this article, we address this issue by proposing three novel strategies to reduce CPT’s size and prediction time, and increase its accuracy. Experimental results on seven real life datasets show that the resulting model (CPT+) is up to 98 times more compact and 4.5 times faster than CPT, and has the best overall accuracy when compared to six state-of-the-art models from the literature: All-K-order Markov, CPT, DG, Lz78, PPM and TDAG.

Ted Gueniche, Philippe Fournier-Viger, Rajeev Raman, Vincent S. Tseng
Mining Association Rules in Graphs Based on Frequent Cohesive Itemsets

Searching for patterns in graphs is an active field of data mining. In this context, most work has gone into discovering subgraph patterns, where the task is to find strictly defined frequently re-occurring structures, i.e., node labels always interconnected in the same way. Recently, efforts have been made to relax these strict demands, and to simply look for node labels that frequently occur near each other. In this setting, we propose to mine association rules between such node labels, thus discovering additional information about correlations and interactions between node labels. We present an algorithm that discovers rules that allow us to claim that if a set of labels is encountered in a graph, there is a high probability that some other set of labels can be found nearby. Experiments confirm that our algorithm efficiently finds valuable rules that existing methods fail to discover.

Tayena Hendrickx, Boris Cule, Pieter Meysman, Stefan Naulaerts, Kris Laukens, Bart Goethals
Mining High Utility Itemsets in Big Data

In recent years, extensive studies have been conducted on

high utility itemsets

(

HUI

) mining with wide applications. However, most of them assume that data are stored in centralized databases with a single machine performing the mining tasks. Consequently, existing algorithms cannot be applied to the big data environments, where data are often distributed and too large to be dealt with by a single machine. To address this issue, we propose a new framework for

mining high utility itemsets in big data

. A novel algorithm named

PHUI-Growth

(

Parallel mining High Utility Itemsets by pattern-Growth

) is proposed for parallel mining HUIs on Hadoop platform, which inherits several nice properties of Hadoop, including easy deployment, fault recovery, low communication overheads and high scalability. Moreover, it adopts the MapReduce architecture to partition the whole mining tasks into smaller independent subtasks and uses Hadoop distributed file system to manage distributed data so that it allows to parallel discover HUIs from distributed data across multiple commodity computers in a reliable, fault tolerance manner. Experimental results on both synthetic and real datasets show that

PHUI-Growth

has high performance on large-scale datasets and outperforms state-of-the-art non-parallel type of HUI mining algorithms.

Ying Chun Lin, Cheng-Wei Wu, Vincent S. Tseng
Decomposition Based SAT Encodings for Itemset Mining Problems

Recently, several constraint programming (CP)/propositional satisfiability (SAT) based encodings have been proposed to deal with various data mining problems including itemset and sequence mining problems. This research issue allows to model data mining problems in a declarative way, while exploiting efficient and generic solving techniques. In practice, for large datasets, they usually lead to constraints network/Boolean formulas of huge size. Space complexity is clearly identified as the main bottleneck behind the competitiveness of these new declarative and flexible models w.r.t. specialized data mining approaches. In this paper, we address this issue by considering SAT based encodings of itemset mining problems. By partitioning the transaction database, we propose a new encoding framework for SAT based itemset mining problems. Experimental results on several known datasets show significant improvements, up to several orders of magnitude.

Said Jabbour, Lakhdar Sais, Yakoub Salhi
A Comparative Study on Parallel LDA Algorithms in MapReduce Framework

Although several parallel latent Dirichlet allocation (LDA) algorithms have been implemented to extract topic features from large-scale text data sets, very few studies compare their performance in real-world industrial applications. In this paper, we build a novel multi-channel MapReduce framework to compare fairly three representative parallel LDA algorithms such as parallel variational Bayes (PVB), parallel Gibbs sampling (PGS) and parallel belief propagation (PBP). Experimental results confirm that PGS yields the best application performance in search engine and online advertising system of Tencent, one of the biggest Internet companies in China, while PBP has the highest topic modeling accuracy. Moreover, PGS is more scalable in MapReduce framework than PVB and PBP because of its low memory usage and efficient sampling technique.

Yang Gao, Zhenlong Sun, Yi Wang, Xiaosheng Liu, Jianfeng Yan, Jia Zeng
Distributed Newton Methods for Regularized Logistic Regression

Regularized logistic regression is a very useful classification method, but for large-scale data, its distributed training has not been investigated much. In this work, we propose a distributed Newton method for training logistic regression. Many interesting techniques are discussed for reducing the communication cost and speeding up the computation. Experiments show that the proposed method is competitive with or even faster than state-of-the-art approaches such as Alternating Direction Method of Multipliers (ADMM) and Vowpal Wabbit (VW). We have released an MPI-based implementation for public use.

Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, Chih-Jen Lin

Recommendation

Frontmatter
Coupled Matrix Factorization Within Non-IID Context

Recommender systems research has experienced different stages such as from user preference understanding to content analysis. Typical recommendation algorithms were built on the following bases: (1) assuming users and items are IID, namely independent and identically distributed, and (2) focusing on specific aspects such as user preferences or contents. In reality, complex recommendation tasks involve and request (1) personalized outcomes to tailor heterogeneous subjective preferences; and (2) explicit and implicit objective coupling relationships between users, items, and ratings to be considered as intrinsic forces driving preferences. This inevitably involves the non-IID complexity and the need of combining subjective preference with objective couplings hidden in recommendation applications. In this paper, we propose a novel generic coupled matrix factorization (CMF) model by incorporating non-IID coupling relations between users and items. Such couplings integrate the intra-coupled interactions within an attribute and inter-coupled interactions among different attributes. Experimental results on two open data sets demonstrate that the user/item couplings can be effectively applied in RS and CMF outperforms the benchmark methods.

Fangfang Li, Guandong Xu, Longbing Cao
Complementary Usage of Tips and Reviews for Location Recommendation in Yelp

Location-based social networks (LBSNs) allow users to share the locations that they have visited with others in a number of ways. LBSNs like Foursquare allow users to ‘check in’ to a location to share their locations with their friends. However, in Yelp, users can engage with the LBSN via modes other than check-ins. Specifically, Yelp allows users to write ‘tips’ and ‘reviews’ for the locations that they have visited. The geo-social correlations in LBSNs have been exploited to build systems that can recommend new locations to users. Traditionally, recommendation systems for LBSNs have leveraged check-ins to generate location recommendations. We demonstrate the impact of two new modalities - tips and reviews, on location recommendation. We propose a graph based recommendation framework which reconciles the ‘tip’ and ‘review’ space in Yelp in a complementary fashion. In the process, we define novel intra-user and intra-location links leveraging tip and review information, leading to a 15% increase in precision over the existing approaches.

Saurabh Gupta, Sayan Pathak, Bivas Mitra
Coupling Multiple Views of Relations for Recommendation

Learning user/item relation is a key issue in recommender system, and existing methods mostly measure the user/item relation from one particular aspect, e.g., historical ratings, etc. However, the relations between users/items could be influenced by multifaceted factors, so any single type of measure could get only a partial view of them. Thus it is more advisable to integrate measures from different aspects to estimate the underlying user/item relation. Furthermore, the estimation of underlying user/item relation should be optimal for current task. To this end, we propose a novel model to couple multiple relations measured on different aspects, and determine the optimal user/item relations via learning the optimal way of integrating these relation measures. Specifically, matrix factorization model is extended in this paper by considering the relations between latent factors of different users/items. Experiments are conducted and our method shows good performance and outperforms other baseline methods.

Bin Fu, Guandong Xu, Longbing Cao, Zhihai Wang, Zhiang Wu
Pairwise One Class Recommendation Algorithm

We address the problem of one class recommendation for a special implicit feedback scenario, where training data only contain binary relevance data that indicate user’ selection or non-selection. A typical example is the followship in social network. In this context, the extreme sparseness raised by sparse positive examples and the ambiguity caused by the lack of negative examples are two main challenges to be tackled with. We dedicate to propose a new model which is tailored to cope with this two challenges and achieve a better topN performance. Our approach is a pairwise rank-oriented model, which is derived on basis of a rank-biased measure Mean Average Precision raised in Information Retrieval. First, we consider rank differences between item pairs and construct a measure function. Second, we integrate the function with a condition formula which is deduced via taking user-biased and item-biased factors into consideration. The two factors are determined by the number of items a user selected and the number of users an item is selected by respectively. Finally, to be tractable for larger dataset, we propose a fast leaning method based on a sampling schema. At the end, we demonstrate the efficiency of our approach by experiments performed on two public available databases of social network, and the topN performance turns out to outperform baselines significantly.

Huimin Qiu, Chunhong Zhang, Jiansong Miao
RIT: Enhancing Recommendation with Inferred Trust

Trust-based recommendation, which aims to incorporate trust relationships between users to improve recommendation performance, has drawn much attention recently. The focus of existing trust-based recommendation methods is on how to use the observed trust relationships. However, the observed trust relationships are usually very sparse in many real applications. In this paper, we propose to infer some unobserved trust relationships to tackle the sparseness problem. In particular, we first infer the unobserved trust relationships by propagating trust along the observed trust relationships; we then propose a novel trust-based recommendation model to combine observed trust and inferred trust where their relative weights are also learnt. Experimental evaluations on two real datasets show the superior of the proposed method in terms of recommendation accuracy.

Guo Yan, Yuan Yao, Feng Xu, Jian Lu
Backmatter
Metadaten
Titel
Advances in Knowledge Discovery and Data Mining
herausgegeben von
Tru Cao
Ee-Peng Lim
Zhi-Hua Zhou
Tu-Bao Ho
David Cheung
Hiroshi Motoda
Copyright-Jahr
2015
Electronic ISBN
978-3-319-18032-8
Print ISBN
978-3-319-18031-1
DOI
https://doi.org/10.1007/978-3-319-18032-8

Premium Partner