Skip to main content
main-content

Über dieses Buch

This two-volume set, LNAI 10234 and 10235, constitutes the thoroughly refereed proceedings of the 21st Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2017, held in Jeju, South Korea, in May 2017.

The 129 full papers were carefully reviewed and selected from 458 submissions. They are organized in topical sections named: classification and deep learning; social network and graph mining; privacy-preserving mining and security/risk applications; spatio-temporal and sequential data mining; clustering and anomaly detection; recommender system; feature selection; text and opinion mining; clustering and matrix factorization; dynamic, stream data mining; novel models and algorithms; behavioral data mining; graph clustering and community detection; dimensionality reduction.

Inhaltsverzeichnis

Frontmatter

Clustering and Anomaly Detection

Frontmatter

A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks

This paper proposes a simple retraining scheme to purposefully adjust unsupervised word embeddings for specific supervised tasks, such as sentence classification. Different from the current methods, which fine-tune word embeddings in training set through the supervised learning procedure, our method treats the labels of task as implicit context information to retrain word embeddings, so that every required word for the intended task obtains task-specific representation. Moreover, because our method is independent of the supervised learning process, it has less risk of over-fitting. We have validated the rationality of our method on various sentence classification tasks. The improvements of accuracy are remarkable, when only scarce training set is available.

Pengda Qin, Weiran Xu, Jun Guo

A Neural Joint Model for Extracting Bacteria and Their Locations

Extracting Lives_In relations between bacteria and their locations involves two steps, namely bacteria/location entity recognition and Lives_In relation classification. Previous work solved this task by pipeline models, which may suffer error propagation and cannot utilize the interactions between these steps. We follow the line of work using joint models, which perform two subtasks simultaneously to obtain better performances. A state-of-the-art neural joint model for relation extraction in the Automatic Content Extraction (ACE) task is adapted to our task. Furthermore, we propose two strategies to improve this model. First, a novel relation is suggested in the second step to detect the errors in the first step, thus this relation can correct some errors in the first step. Second, we replace the original greedy-search decoding with beam-search, and train the model with early-update techniques. Experimental results on a standard dataset for this task show that our adapted model achieves better precisions than other systems. After adding the novel relation, we gain a nearly 2% improvement of F1 for Lives_In relation extraction. When beam-search is used, the F1 is further improved by 6%. These demonstrate that our proposed strategies are effective for this task. However, additional experiments show that the performance improvement in another dataset of bacteria and location extraction is not significant. Therefore, whether our methods are effective for other relation extraction tasks needs to be further investigated.

Fei Li, Meishan Zhang, Guohong Fu, Donghong Ji

Advanced Computation of Sparse Precision Matrices for Big Data

The precision matrix is the inverse of the covariance matrix. Estimating large sparse precision matrices is an interesting and a challenging problem in many fields of sciences, engineering, humanities and machine learning problems in general. Recent applications often encounter high dimensionality with a limited number of data points leading to a number of covariance parameters that greatly exceeds the number of observations, and hence the singularity of the covariance matrix. Several methods have been proposed to deal with this challenging problem, but there is no guarantee that the obtained estimator is positive definite. Furthermore, in many cases, one needs to capture some additional information on the setting of the problem. In this paper, we introduce a criterion that ensures the positive definiteness of the precision matrix and we propose the inner-outer alternating direction method of multipliers as an efficient method for estimating it. We show that the convergence of the algorithm is ensured with a sufficiently relaxed stopping criterion in the inner iteration. We also show that the proposed method converges, is robust, accurate and scalable as it lends itself to an efficient implementation on parallel computers.

Abdelkader Baggag, Halima Bensmail, Jaideep Srivastava

Accurate Recognition of the Current Activity in the Presence of Multiple Activities

Sensor based activity recognition (AR) has gained extensive attention in recent years due to the ubiquitous presence of smart devices, such as smartphones and smartwatches. One of the major challenges posed by AR is to reliably recognize the current activity, when a given window of time series data contains several activities. Most of the traditional AR methods assume the entire window corresponds to a single activity, which may cause high error rate in activity recognition. To overcome this challenge, we propose a Weighted Min-max Activity Recognition Model (WMARM), which reliably predicts the current activity by finding an optimal partition of the time series matching the occurred activities. WMARM can handle the time series containing an arbitrary number of activities, without having any prior knowledge about the number of activities. We devise an efficient dynamic programming algorithm that solves WMARM in $$\mathcal {O}(n^2)$$ time complexity, where n is the length of the window. Extensive experiments conducted on 5 real datasets demonstrate about 10%–30% improvement on accuracy of WMARM compared to the state-of-the-art methods.

Weihao Cheng, Sarah Erfani, Rui Zhang, Ramamohanarao Kotagiri

Modeling Information Sharing Behavior on Q&A Forums

Q&A forums pool massive amounts of crowd expertise from a broad spectrum of geographical, cultural, and disciplinary knowledge toward specific, user-posed questions. Existing studies on these forums focus on how to route questions to the best answerers based on content or predict whether a question will be answered, but few of them investigated the inherent knowledge sharing relationship among users. We study knowledge sharing among users of StackOverflow, a popular Q&A forum, where the knowledge sharing process is related to the time elapsed since a question was posted, the reputation of the questioner, and the content of the posted text. Taking these factors into consideration, the paper proposes time-based information sharing model (TISM), where the likelihood a user will share or provide knowledge to another is modeled as a continuous function of time, reputation, and post length. With the resulting knowledge sharing network learned by TISM, we are able to predict for a given question the number of responses over time, who will answer the question and who will provide the accepted answer. Our experiments show that predictions using TISM outperform NetRate, query likelihood language, random forest, and linear regression models.

Biru Cui, Shanchieh Jay Yang, Christophan M. Homan

Effective Multiclass Transfer for Hypothesis Transfer Learning

In this paper, we investigate the visual domain adaptation problem under the setting of Hypothesis Transfer Learning (HTL) where we can only access the source model instead of the data. However, previous studies of HTL are limited to either leveraging the knowledge from certain type of source classifier or low transfer efficiency on a small training set. In this paper, we aim at two important issues: effectiveness of the transfer on small target training set and compatibility of the transfer model for real-world HTL problems. To solve these two issues, we proposed our method, Effective Multiclass Transfer Learning (EMTLe). We demonstrate that EMTLe, which uses the prediction of the source models as the transferable knowledge can exploit the knowledge of different types of source classifiers. We use the transfer parameter to weigh the importance the prediction of each source model as the auxiliary bias. Then we use the bi-level optimization to estimate the transfer parameter and demonstrate that we can effectively obtain the optimal transfer parameter with our novel objective function. Empirical results show that EMTLe can effectively exploit the knowledge and outperform other HTL baselines when the size of the target training set is small.

Shuang Ao, Xiang Li, Charles X. Ling

Clustering Based on Dominant Set and Cluster Expansion

While numerous clustering algorithms can be found in the literature, existing algorithms are usually afflicted by two major problems. First, the majority of clustering algorithms requires user-specified parameters as input, and their clustering results rely heavily on these parameters. Second, many algorithms generate clusters of only spherical shapes. In this paper we try to solve these two problems based on dominant set and cluster expansion. We firstly use a modified dominant sets clustering algorithm to generate initial clusters which are parameter independent and usually smaller than the real clusters. Then we expand the initial clusters based on two density based clustering algorithms to generate clusters of arbitrary shapes. In experiments on various datasets our algorithm outperforms the original dominant sets algorithm and several other algorithms. It is also shown to be effective in image segmentation experiments.

Jian Hou, Weixue Liu

Recommender Systems

Frontmatter

Friend Recommendation Considering Preference Coverage in Location-Based Social Networks

Friend recommendation (FR) becomes a valuable service in location-based social networks. Its essential purpose is to meet social demand and demand on obtaining information. The most of current existing friend recommendation methods mainly focus on the preference similarity and common friends between users for improving the recommendation quality. The similar users are likely to have similar preferences of point-of-interests (POIs), the kinds of information they provided are limited and redundant, can not cover all of the target user’s preferences of POIs. This paper aims to improve amount of information on users’ preferences through FR. We give a definition of friend recommendation considering preference coverage problem (FRPCP), and it is also one NP-hard problem. This paper proposes the greedy algorithm to solve the problem. Compared to the existing typical recommendation approaches, the large-scale LBSN datasets validate recommendation quality and significant increase in the degree to preferences coverage.

Fei Yu, Nan Che, Zhijun Li, Kai Li, Shouxu Jiang

Contrast Pattern Based Collaborative Behavior Recommendation for Life Improvement

Positive attitudes and happiness have major impacts on human health and in particular recovery from illness. While contributing factors leading human beings to positive emotional states are studied in psychology, the effects of these factors vary and change from one person to another. We propose a behaviour recommendation system that recommends the most effective behaviours leading users with a negative mental state (i.e. unhappiness) to a positive emotional state (i.e., happiness). By leveraging the contrast pattern mining framework, we extract the common contrasting behaviours between happy and unhappy users. These contrast patterns are aligned with user behaviours and habits. We find the personalized behaviour recommendation for those with negative emotional states by placing the problem into the nearest neighborhood collaborative filtering framework. A real dataset of people with heart disease or diabetes is used in our recommendation system. The experiments conducted show that the proposed method can be effective in the health-care domain.

Yan Chen, Margot Lisa-Jing Yann, Heidar Davoudi, Joy Choi, Aijun An, Zhen Mei

Exploiting Location Significance and User Authority for Point-of-Interest Recommendation

With the rapid growth of location-based social networks (LBSNs), point-of-interest (POI) recommendation has become indispensable. Several approaches have been proposed to support personalized POI recommendation in LBSNs. However, most of the existing matrix factorization based methods treat users’ check-in frequencies as ratings in traditional recommender systems and model users’ check-in behaviors using the Gaussian distribution, which is unsuitable for modeling the heavily skewed frequency data. In addition, little methods systematically consider the effects of location significance and user authority on users’ final check-in decision processes. In this paper, we integrate probabilistic factor model and location significance to model users’ check-in behaviors, and propose a location significance and user authority enhanced probabilistic factor model. Specifically, a hybrid model of HITS and PageRank is adapted to compute user authority and location significance. Moreover, user authorities are used to weight users’ implicit feedback. Experimental results on two real world data sets show that our proposed approach outperforms the state-of-the-art POI recommendation algorithms.

Yonghong Yu, Hao Wang, Shuanzhu Sun, Yang Gao

Personalized Ranking Recommendation via Integrating Multiple Feedbacks

Recently, recommender system has attracted a lot of attentions, which helps users to find items of interest through utilizing the user-item interaction information and/or content information associated with users and items. The interaction information (i.e., feedback) between users and items are widely exploited to build recommendation models. The feedback data in recommender systems usually comes in the form of both explicit feedback (e.g., rating) and implicit feedback (e.g., browsing histories, click logs). Although existing works have begun to utilize either explicit or implicit feedback for better recommendation, they did not make best use of these feedback information together. In this paper, we first study the personalized ranking recommendation problem by integrating multiple feedbacks, i.e., one type of explicit feedback and multiple types of implicit feedbacks. Then we propose a unified and flexible personalized ranking framework MFPR to integrate multiple feedbacks. Moreover, as there are no readily available training data, an explicit feedback based training data generation algorithm is designed to generate item pairs with more accurate partial order consistent with the multiple feedbacks for the proposed ranking model. Extensive experiments on two real-world datasets validate the effectiveness of the MFPR model, and the integration of multiple feedbacks making up better complementary information significantly improves recommendation performance.

Jian Liu, Chuan Shi, Binbin Hu, Shenghua Liu, Philip S. Yu

Fairness Aware Recommendations on Behance

Traditionally, recommender systems strive to maximize the user acceptance of the recommendations, while more recently, diversity and serendipity have also been addressed. In two-sided platforms, the users can have two personas, consumers who would like relevant and diverse recommendations, and creators who would like to receive exposure for their creations. If the new creators do not get adequate exposure, they tend to leave the platform, and consequently, less content is generated, resulting in lower consumer satisfaction. We propose a re-ranking strategy that can be applied to the scored recommendation lists to improve exposure distribution across the creators (thereby improving the fairness), without unduly affecting the relevance of recommendations provided to the consumers. We also propose a different notion of diversity, which we call representative diversity, as opposed to dissimilarity based diversity, that captures level of interest of the consumer in different categories. We show that our method results in recommendations that have much higher level of fairness and representative diversity compared to the state-of-art recommendation strategies, without compromising the relevance score too much. Interestingly, higher diversity and fairness leads to increased user acceptance rate of the recommendations.

Natwar Modani, Deepali Jain, Ujjawal Soni, Gaurav Kumar Gupta, Palak Agarwal

A Performance Evaluation Model for Taxi Cruising Path Recommendation System

Recommending an appropriate route to reduce taxi drivers’ mileage spent without a fare is a long-standing challenge. The current solution has been to get the best route which has optimal performance, and the performance usually combined the conditional probability for getting a passenger and the cruising distance. However, the main reference has some limitation. To eliminate the limitation, a novel model is proposed to evaluate the candidate route performance. And based on this new model, a recommendation system is tested. Firstly, by mining the knowledge of the historical taxi trajectory, we extract the temporal probabilistic recommending points. Then based on it, the evaluation model is presented to estimate the performance of each candidate route. Finally, a route recommendation algorithm is used to get the optimal route for taxi drivers. And as the result, the experiment is performed on real-world taxi trajectories data set, and shows the effectiveness of the proposed model for evaluating the performance.

Huimin Lv, Fang Fang, Yishi Zhao, Yuanyuan Liu, Zhongwen Luo

MaP2R: A Personalized Maximum Probability Route Recommendation Method Using GPS Trajectories

Personalized travel route recommendation refers to the planning of an optimal travel route between two geographical locations based on the road networks and users’ travel preferences. In this paper, we extract users’ travel behaviours from their historical GPS trajectories and propose a personalized maximum probability route recommendation method called MaP2R. MaP2R utilizes the concepts of appearance behaviour and transition behaviour to describe users’ travel behaviours and applies matrix factorization and Laplace smoothing method to estimate users’ travel behaviour probabilities. When making recommendation, a route with the maximum probability of a user’s travel behaviours is generated based on Markov property and searched through a generated behaviour graph. The experimental results on a real GPS trajectory dataset show that the proposed MaP2R achieves better results for travel route recommendations compared with the existing methods.

Ge Cui, Xin Wang

Feature Selection

Frontmatter

SNE: Signed Network Embedding

Several network embedding models have been developed for unsigned networks. However, these models based on skip-gram cannot be applied to signed networks because they can only deal with one type of link. In this paper, we present our signed network embedding model called SNE. Our SNE adopts the log-bilinear model, uses node representations of all nodes along a given path, and further incorporates two signed-type vectors to capture the positive or negative relationship of each edge along the path. We conduct two experiments, node classification and link prediction, on both directed and undirected signed networks and compare with four baselines including a matrix factorization method and three state-of-the-art unsigned network embedding models. The experimental results demonstrate the effectiveness of our signed network embedding.

Shuhan Yuan, Xintao Wu, Yang Xiang

mHUIMiner: A Fast High Utility Itemset Mining Algorithm for Sparse Datasets

High utility itemset mining is the problem of finding sets of items whose utilities are higher than or equal to a specific threshold. We propose a novel technique called mHUIMiner, which utilises a tree structure to guide the itemset expansion process to avoid considering itemsets that are nonexistent in the database. Unlike current techniques, it does not have a complex pruning strategy that requires expensive computation overhead. Extensive experiments have been done to compare mHUIMiner to other state-of-the-art algorithms. The experimental results show that our technique outperforms the state-of-the-art algorithms in terms of running time for sparse datasets.

Alex Yuxuan Peng, Yun Sing Koh, Patricia Riddle

Partial Tree-Edit Distance: A Solution to the Default Class Problem in Pattern-Based Tree Classification

Pattern-based tree classifiers are capable of producing high quality results, however, they are prone to the problem of the default class overuse. In this paper, we propose a measure designed to address this issue, called partial tree-edit distance (PTED), which allows for assessing the degree of containment of one tree in another. Furthermore, we propose an algorithm which calculates the measure and perform an experiment involving pattern-based classification to illustrate its usefulness. The results show that incorporating PTED into the classification scheme allowed us to significantly improve the accuracy on the tested datasets.

Maciej Piernik, Tadeusz Morzy

A Domain-Agnostic Approach to Spam-URL Detection via Redirects

Web services like social networks, video streaming sites, etc. draw numerous viewers daily. This popularity makes them attractive targets for spammers to distribute hyperlinks to malicious content. In this work we propose a new approach for detecting spam URLs on the Web. Our key idea is to leverage the properties of URL redirections widely deployed by spammers. We combine the redirect chains into a redirection graph that reveals the underlying infrastructure in which the spammers operate, and design our method to build on key characteristics closely associated with the modus operandi of the spammers. Different from previous work, our approach exhibits three key characteristics; (1) domain-independence, which enables it to generalize across different Web services, (2) adversarial robustness, which incurs difficulty, risk, or cost on spammers to evade as it is tightly coupled with their operational behavior, and (3) semi-supervised detection, which uses only a few labeled examples to produce competitive results thanks to its effective usage of the redundancy in spammers’ operations. Evaluation on large Twitter datasets shows that we achieve above 0.96 recall and 0.70 precision with false positive rate below 0.07 with only 1% of labeled data.

Heeyoung Kwon, Mirza Basim Baig, Leman Akoglu

Automatic and Effective Mining of Coevolving Online Activities

Given a large collection of time-evolving online user activities, such as Google Search queries for multiple keywords of various categories (celebrities, events, diseases, etc.), which consist of $$d$$ keywords/activities, for $$l$$ countries/locations of duration $$n$$, how can we find patterns and rules? How do we go about capturing non-linear evolutions of local activities and forecasting future patterns? We also aim to achieve good monitoring of the data sequences statistically, and detection of the patterns immediately. In this paper, we present $${\varDelta \hbox {-}\textsc {SPOT}} $$, a unifying analytical non-linear model for analysing large scale web search data, which is sense-making, automatic, scalable and free of parameters. $${\varDelta \hbox {-}\textsc {SPOT}} $$ can also forecast long-range future dynamics of the keywords/queries. Besides, we also provide an efficient and effective fitting algorithm, which leads to novel discoveries and sense-making features, and contribute to the need of monitoring multiple co-evolving data sequences.

Thinh Minh Do, Yasuko Matsubara, Yasushi Sakurai

Keeping Priors in Streaming Bayesian Learning

Exploiting prior knowledge in the Bayesian learning process is one way to improve the quality of Bayesian model. To the best of our knowledge, however, there is no formal research about the influence of prior in streaming environment. In this paper, we address the problem of using prior knowledge in streaming Bayesian learning, and develop a framework for keeping priors in streaming learning (KPS) that maintains knowledge from the prior through each minibatch of streaming data. We demonstrate the performance of our framework in two scenarios: streaming learning for latent Dirichlet allocation and streaming text classification in comparison with methods that do not keep prior.

Anh Nguyen Duc, Ngo Van Linh, Anh Nguyen Kim, Khoat Than

Text and Opinion Mining

Frontmatter

Efficient Training of Adaptive Regularization of Weight Vectors for Semi-structured Text

We propose an efficient training method of Confidence Weighted Learning (CWL) algorithms for semi-structured text and its application to Adaptive Regularization of Weight Vectors (AROW), which is a CWL algorithm. CWL algorithms are online learning algorithms that combines large margin training and confidence weighting of features. CWL algorithms learn confidence weights of features, therefore, it is difficult to apply kernel methods that implicitly expand features. If we expand features in advance, it leads to increased memory usage. To solve the problem, we propose a training method that dynamically extracted features from semi-structured text. In addition, we propose a pruning method for improved training speed. The pruning skips training samples classified correctly more than or equal to certain times. We compared our method using word-strings as semi-structured texts with AROW that expands all the features in advance. Experimental results of text classification tasks on an Amazon data set show that our training method contributes to improved memory usage and two to three times faster training speed while maintaining accuracy for learning longer n-grams.

Tomoya Iwakura

Behavior-Based Location Recommendation on Location-Based Social Networks

Location recommendation makes suggestions of nearby locations based on user’s locational preferences and spatial movement patterns. In this paper, we propose two novel location recommendation methods called Behavior Factorization (BF) and Latent Behavior Analysis (LBA). Both methods utilize behavioral and spatio-temporal patterns in user movements to make location recommendation. Experiments on a real-world dataset shows that the proposed methods outperform existing location recommendation methods in terms of both precision and recall. Comparing LBA and BF, it is observed that LBA achieves better results since it utilizes the number of times each pattern has happened in the dataset.

Seyyed Mohammadreza Rahimi, Xin Wang, Behrouz Far

Integer Linear Programming for Pattern Set Mining; with an Application to Tiling

Pattern set mining is an important part of a number of data mining tasks such as classification, clustering, database tiling, or pattern summarization. Efficiently mining pattern sets is a highly challenging task and most approaches use heuristic strategies. In this paper, we formulate the pattern set mining problem as an optimization task, ensuring that the produced solution is the best one from the entire search space. We propose a method based on integer linear programming (ILP) that is exhaustive, declarative and optimal. ILP solvers can exploit different constraint types to restrict the search space, and can use any pattern set measure (or combination thereof) as an objective function, allowing the user to focus on the optimal result. We illustrate and show the efficiency of our method by applying it to the tiling problem.

Abdelkader Ouali, Albrecht Zimmermann, Samir Loudni, Yahia Lebbah, Bruno Cremilleux, Patrice Boizumault, Lakhdar Loukil

Secured Privacy Preserving Data Aggregation with Semi-honest Servers

With the large deployment of smart devices, the collections and analysis of user data significantly benefit both industry and people’s daily life. However, it has showed a serious risk to people’s privacy in the process of the above applications. Recently, combining multiparty computation and differential privacy was a popular strategy to guarantee both computational security and output privacy in distributed data aggregation. To decrease the communication cost in traditional multiparty computation paradigm, the existing work introduces several trusted servers to undertake the main computing tasks. But we will lose the guarantee on both security and privacy when the trusted servers are vulnerable to adversaries. To address the privacy disclosure problem caused by the vulnerable servers, we provide a two-layer randomisation privacy preserved data aggregation framework with semi-honest servers (we only take their computation ability but do not trust them). Differing from the existing approach introduces differential privacy noises globally, our framework randomly adds random noises but maintains the same differential privacy guarantee. Theoretical and experimental analysis show that to achieve same security and privacy insurance, our framework provides better data utility than the existing approach.

Zhigang Lu, Hong Shen

Efficient Pedestrian Detection in the Low Resolution via Sparse Representation with Sparse Support Regression

We propose a novel pedestrian detection approach in the extreme Low-Resolution (LR) images via sparse representation. Pedestrian detection in the extreme LR images is very important for some specific applications such as abnormal event detection and video forensics from surveillance videos. Although the pedestrian detection in High-Resolution (HR) images has achieved remarkable progress, it is still a challenging task in the LR images, because the discriminative information in the HR images usually disappear in the LR ones. It makes the precision of the detectors in the LR images decrease by a large margin. Most of the traditional methods enlarge the LR image by the linear interpolation methods. However, it can not preserve the high frequency information very well, which is very important for the detectors. For solving this problem, we reconstruct the LR image in the high resolution by sparse representation. In our model, the LR and HR dictionaries are established respectively in the training stage, and the representative coefficients mapping relations are determined. Moreover, for improving the speed of feature extraction, the feature reconstruction in the LR images is converted to the sparse linear combination between the coefficients and the response of the atoms in HR dictionary by the LR-HR mapping, no matter how complex the feature extraction is. Experiments on the four challenging datasets: Caltech, INRIA, ETH and TUD-Brussels, demonstrate that our proposed method outperforms the state-of-the-art approaches and is much efficient with more than 10 times speedup.

Wenhua Fang, Jun Chen, Ruimin Hu

Multi-task Representation Learning for Enhanced Emotion Categorization in Short Text

Embedding based dense contextual representations of data have proven to be efficient in various NLP tasks as they alleviate the burden of heavy feature engineering. However, generalized representation learning approaches do not capture the task specific subtleties. In addition, often the computational model for each task is developed in isolation, overlooking the interrelation among certain NLP tasks. Given that representation learning typically requires a good amount of labeled annotated data which is scarce, it is essential to explore learning embedding under supervision of multiple related tasks jointly and at the same time, incorporating the task specific attributes too. Inspired by the basic premise of multi-task learning, which supposes that correlation between related tasks can be used to improve classification, we propose a novel technique for building jointly learnt task specific embeddings for emotion and sentiment prediction tasks. Here, a sentiment prediction task acts as an auxiliary input to enhance the primary emotion prediction task. Our experimental results demonstrate that embeddings learnt under supervised signals of two related tasks, outperform embeddings learnt in a uni-tasked setup for the downstream task of emotion prediction.

Anirban Sen, Manjira Sinha, Sandya Mannarswamy, Shourya Roy

Fine-Grained Emotion Detection in Contact Center Chat Utterances

Contact center chats are textual conversations involving customers and agents on queries, issues, grievances etc. about products and services. Contact centers conduct periodic analysis of these chats to measure customer satisfaction, of which the chat emotion forms one crucial component. Typically, these measures are performed at chat level. However, retrospective chat-level analysis is not sufficiently actionable for agents as it does not capture the variation in the emotion distribution across the chat. Towards that, we propose two novel weakly supervised approaches for detecting fine-grained emotions in contact center chat utterances in real time. In our first approach, we identify novel contextual and meta features and treat the task of emotion prediction as a sequence labeling problem. In second approach, we propose a neural net based method for emotion prediction in call center chats that does not require extensive feature engineering. We establish the effectiveness of the proposed methods by empirically evaluating them on a real-life contact center chat dataset. We achieve average accuracy of the order 72.6% with our first approach and 74.38% with our second approach respectively.

Shreshtha Mundra, Anirban Sen, Manjira Sinha, Sandya Mannarswamy, Sandipan Dandapat, Shourya Roy

Dependency-Tree Based Convolutional Neural Networks for Aspect Term Extraction

Aspect term extraction is one of the fundamental subtasks in aspect-based sentiment analysis. Previous work has shown that sentences’ dependency information is critical and has been widely used for opinion mining. With recent success of deep learning in natural language processing (NLP), recurrent neural network (RNN) has been proposed for aspect term extraction and shows the superiority over feature-rich CRFs based models. However, because RNN is a sequential model, it can not effectively capture tree-based dependency information of sentences thus limiting its practicability. In order to effectively exploit sentences’ dependency information and leverage the effectiveness of deep learning, we propose a novel dependency-tree based convolutional stacked neural network (DTBCSNN) for aspect term extraction, in which tree-based convolution is introduced over sentences’ dependency parse trees to capture syntactic features. Our model is an end-to-end deep learning based model and it does not need any human-crafted features. Furthermore, our model is flexible to incorporate extra linguistic features to further boost the model performance. To substantiate, results from experiments on SemEval2014 Task4 datasets (reviews on restaurant and laptop domain) show that our model achieves outstanding performance and outperforms the RNN and CRF baselines.

Hai Ye, Zichao Yan, Zhunchen Luo, Wenhan Chao

Topic Modeling over Short Texts by Incorporating Word Embeddings

Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn semantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo-texts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. The experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.

Jipeng Qiang, Ping Chen, Tong Wang, Xindong Wu

Mining Drug Properties for Decision Support in Dental Clinics

The rise of polypharmacy requires from health providers an awareness of a patient’s drug profile before prescribing. Existing methods to extract information on drug interactions do not integrate with the patient’s medical history. This paper describes state-of-the-art approaches in extracting the term frequencies of drug properties and combining this knowledge with consideration of the patient’s drug allergies and current medications to decide if a drug is suitable for prescription. Experimental evaluation of our models association of the similarity ratio between two drugs (based on each drug’s term frequencies) with the similarity between them yields a superior accuracy of 79%. Similarity to a drug the patient is allergic to or is currently taking are important considerations as to the suitability of a drug for prescription. Hence, such an approach, when integrated within the clinical workflow, will reduce prescription errors thereby increasing the health outcome of the patient.

Wee Pheng Goh, Xiaohui Tao, Ji Zhang, Jianming Yong

PURE: A Novel Tripartite Model for Review Sentiment Analysis and Recommendation

Nowadays, more and more users like to leave online reviews. These reviews, which are based on their experiences on a set of service or products, often express different opinions and sentiments. Correlated topic model (CTM), an effective text mining model, can reduce the dimension without losing important information. However, traditional analyses based on CTM still have some problems. In this paper, we propose the Product-User-Review tripartite sEntiment model (PURE), which is based on content-based clustering to optimize CTM, to select topic number, extract feature, estimate the reviews’ utility. Moreover, our model analyzes the reviews from the user’s preferences, review content and product properties in three dimensions. Based on the five indexes, such as informative attributes and sentiment attributes, the feature vector of the review data is constructed. We found that after adding user’s preference feature in sentiment analysis and utility estimation, PURE achieves high accuracy and classification speed in the review-mixing Chinese and English processing, and the quality of selection is improved significantly by 21%. To the best of our knowledge, this is the first work to incorporate users’ preference feature in optimized CTM to do the study of sentiment analysis, review selection and recommendation.

Yue Xue, Liutong Xu, Hai Huang, Yao Cheng

Clustering and Matrix Factorization

Frontmatter

Multi-View Matrix Completion for Clustering with Side Information

In many clustering applications, real world data are often collected from multiple sources or with features from multiple channels. Thus, multi-view clustering has attracted much attention during the past few years. It is noteworthy that in many situations, in addition to the data samples, there are some side information describing the relation between instances, such as must-links and cannot-links. Though side information has been well exploited in single-view clustering, they have rarely been studied in multi-view scenario. Considering that matrix completion has sound theoretical properties and demonstrates an excellent performance in single-view clustering, in this paper, we propose the first matrix completion based approach for multi-view clustering with side information. Instead of concatenating multiple views into a single one, we enforce the consistency of clustering results on different views as constraints for alternative optimization, and the global optimal solution is obtained since the objective function is jointly convex. The proposed Multi-View Matrix Completion (MVMC) approach exhibits impressive performance in experiments.

Peng Zhao, Yuan Jiang, Zhi-Hua Zhou

Weighted NMF-Based Multiple Sparse Views Clustering for Web Items

Many web items contain different types of information resources such as user profile, comments, users preference and so on. All these aspects can be seen as different views of real-world datasets and often admit same underlying clustering of the data. However, each view of dataset forming a huge sparse matrix results in the non-robust characteristic during matrix decomposition process, and further influences the accuracy of clustering results. In this paper, we attempt to use rating value given by the users as latent semantic information to handle those features that are unobserved in each data point so as to resolve the sparseness problem in all views matrices. To combine multiple views in our constructed corpus Doucom, we present WScoNMF (Weighted similarity co-regularized Non-negative Matrix Factorization), which provides an efficient weighted matrix factorization framework to further explore the sparseness problem in semantic space of data. The overall objective function is to minimize the loss function of weighted NMF under the $$l _{2,1}$$-norm and the co-regularized constraint under the F-norm. Experimental results on all datasets demonstrate the effectiveness of the proposed method.

Xiaolong Gong, Fuwei Wang, Linpeng Huang

Parallel Visual Assessment of Cluster Tendency on GPU

Determining the number of clusters in a data set is a critical issue in cluster analysis. The Visual Assessment of (cluster) Tendency (VAT) algorithm is an effective tool for investigating cluster tendency, which produces an intuitive image of matrix as the representation of complex data sets. However, VAT can be computationally expensive for large data sets due to its $$ O\left( {N^{2} } \right) $$ time complexity. In this paper, we propose an efficient parallel scheme to accelerate the original VAT using NVIDIA GPU and CUDA architecture. We show that, on a range of data sets, the GPU-based VAT features good scalability and can achieve significant speedups compared to the original algorithm.

Tao Meng, Bo Yuan

Clustering Complex Data Represented as Propositional Formulas

Clustering has been extensively studied to deal with different kinds of data. Usually, datasets are represented as a n-dimensional vector of attributes described by numerical or nominal categorical values. Symbolic data is another concept where the objects are more complex such as intervals, multi-categorical or modal. However, new applications might give rise to even more complex data describing for example customer desires, constraints, and preferences. Such data can be expressed more compactly using logic-based representations. In this paper, we introduce a new clustering framework, where complex objects are described by propositional formulas. First, we extend the two well-known k-means and hierarchical agglomerative clustering techniques. Second, we introduce a new divisive algorithm for clustering objects represented explicitly by sets of models. Finally, we propose a propositional satisfiability based encoding of the problem of clustering propositional formulas without the need for an explicit representation of their models. Preliminary experimental results validating our proposed framework are provided.

Abdelhamid Boudane, Said Jabbour, Lakhdar Sais, Yakoub Salhi

Deep Bayesian Matrix Factorization

Matrix factorization is a popular collaborative filtering technique, assuming that the matrix of ratings can be written as the inner product of two low-rank matrices, comprising latent features assigned to each user/item. Recently, several researchers have developed Bayesian treatments of matrix factorization, that infer posterior distributions over the postulated user and item latent features. As it has been shown, by allowing for taking uncertainty into account, such Bayesian inference approaches can better model sparse data, which are prevalent in real-world applications. In this paper, we consider replacing the inner product in the likelihood function of Bayesian matrix factorization with an arbitrary function that we learn from the data at the same time as we learn the latent feature posteriors; specifically, we parameterize the likelihood function using dense layer (DL) deep networks. In addition, to allow for addressing the cold-start problem, we also devise a model extension that takes into account item content, treated as side information. We provide extensive experimental evaluations on several real-world datasets; we show that our method completely outperforms state-of-the-art alternatives, without compromising computational efficiency.

Sotirios P. Chatzis

Dynamic, Stream Data Mining

Frontmatter

Mining Competitive Pairs Hidden in Co-location Patterns from Dynamic Spatial Databases

Co-location pattern discovery is an important branch in spatial data mining. A spatial co-location pattern represents the subset of spatial features which are frequently located together in a geographic space. However, maybe some features in a co-location get benefit from the others, maybe they just accidentally share the similar environment, or maybe they competitively live in the same environment. In fact, many interesting knowledge have not been discovered. One of them is competitive pairs. Competitive relationship widely exists in nature and society and worthy to research. In this paper, competitive pairs hidden in co-locations are discovered from dynamic spatial databases. At first, competitive participation index which is the measure to show the competitive strength is calculated. After that, the concept of competitive pair is defined. For improving the course of mining competitive pairs, a series of pruning strategies are given. The methods make it possible to discover both competitive pairs and prevalent co-location patterns efficiently. The extensive experiments evaluate the proposed methods with “real + synthetic” data sets and the results show that competitive pairs are interesting and different from prevalent co-locations.

Junli Lu, Lizhen Wang, Yuan Fang, Momo Li

Utility Aware Clustering for Publishing Transactional Data

This work aims to maximise the utility of published data for the partition-based anonymisation of transactional data. We make an observation that, by optimising the clustering i.e. horizontal partitioning, the utility of published data can significantly be improved without affecting the privacy guarantees. We present a new clustering method with a specially designed distance function that considers the effect of sensitive terms in the privacy goal as part of the clustering process. In this way, when the clustering minimises the total intra-cluster distances of the partition, the utility loss is also minimised. We present two algorithms DocClust and DetK for clustering transactions and determining the best number of clusters respectively.

Michael Bewong, Jixue Liu, Lin Liu, Jiuyong Li

Self-tuning Filers — Overload Prediction and Preventive Tuning Using Pruned Random Forest

The holy-grail of large complex storage systems in enterprises today is for these systems to be self-governing. We propose a self-tuning scheme for large storage filers, on which very little work has been done in the past. Our system uses the performance counters generated by a filer to assess its health in real-time and modify the workload and/or tune the system parameters for optimizing the operational metrics. We use a Pruned Random Forest based solution to predict overload in real-time — the model is run on every snapshot of counter values. Large number of trees in a random forest model has an immediate adverse effect on the time to take a decision. A large random forest is therefore not viable in a real-time scenario. Our solution uses a pruned random forest that performs as well as the original forest. A saliency analysis is carried out to identify components of the system that require tuning in case an overload situation is predicted. This allows us to initiate some ‘action’ on the bottleneck components. The ‘action’ we have explored in our experiments is ‘throttling’ the bottleneck component to prevent overload situations.

Kumar Dheenadayalan, Gopalakrishnan Srinivasaraghavan, V. N. Muralidhara

A Centrality-Based Local-First Approach for Analyzing Overlapping Communities in Dynamic Networks

With the increasing demand of dynamic graph data analysis, mining communities in time-evolving data has been a research hotspot. However, traditional community detection methods have efficiency issue in the huge dynamic network data and rarely consider about overlapping communities. In this paper, we first propose a centrality-based local-first approach for overlapping community discovery in static network, called CBLF. Different with the traditional top-down approach, CBLF detects communities from central nodes and theirs neighbors which conforms to reality better. Then we present a novel evolutionary community detection approach called CBLFD based on this effective approach and sequence smoothing mechanism. Experimental results on real-world and synthetic datasets demonstrate that these algorithms achieve higher accuracy and efficiency compared with the state-of-art algorithms.

Ximan Chen, Heli Sun, Hongxia Du, Jianbin Huang, Ke Liu

Web-Scale Personalized Real-Time Recommender System on Suumo

In this paper we investigate the performance of machine learning based recommender system with real-time log streaming on a large real-estate site, in the views of system robustness, business productivity and algorithm performance. Our proposed recommender system, providing personalized contents as opposed to item/query based recommendation, consists of a real-time log processor, auto-scaling recommender API and machine learning modules. System is carefully designed to let data scientists focus on improving core algorithms and features (instead of taking care of distributing systems) and achieves weekly release cycle in production environment. On Suumo, the largest real-estate portal site in Japan, the system returns more than 99.9% of the API calls successfully in real-time and shows finally a 250% improvement of conversion rate compared to the existing recommendation. With its flexible nature, we would also expect the system to be applied in various kinds of real-time recommendation in the near future.

Shiyingxue Li, Shimpei Nomura, Yohei Kikuta, Kazuma Arino

Modeling Contextual Changes in User Behaviour in Fashion e-Commerce

Impulse purchases are quite frequent in fashion e-commerce; browse patterns indicate fluid context changes across diverse product types probably due to the lack of a well-defined need at the consumer’s end. Data from our fashion e-commerce portal indicate that the final product a person ends-up purchasing is often very different from the initial product he/she started the session with. We refer to this characteristic as a ‘context change’. This feature of fashion e-commerce makes understanding and predicting user behaviour quite challenging. Our work attempts to model this characteristic so as to both detect and preempt context changes. Our approach employs a deep Gated Recurrent Unit (GRU) over clickstream data. We show that this model captures context changes better than other non-sequential baseline models.

Ashay Tamhane, Sagar Arora, Deepak Warrier

Weighted Ensemble Classification of Multi-label Data Streams

Many real world applications involve classification of multi-label data streams. However, most existing classification models mostly focused on classifying single-label data streams. Learning in multi-label data stream scenarios is more challenging, as the classification systems should be able to consider several properties, such as large data volumes, label correlations and concept drifts. In this paper, we propose an efficient and effective ensemble model for multi-label stream classification based on ML-KNN (Multi-Label KNN) [31] and propose a balance AdjustWeight function to combine the predictions which can efficiently process high-speed multi-label stream data with concept drifts. The empirical results indicate that our approach achieves a high accuracy and low storage cost, and outperforms the existing methods ML-KNN and SMART [14].

Lulu Wang, Hong Shen, Hui Tian

Novel Models and Algorithms

Frontmatter

Improving One-Class Collaborative Filtering with Manifold Regularization by Data-driven Feature Representation

When considering additional features of users or items in a recommendation system, previous work focuses mainly on manually incorporating these features into original models. In this paper, manifold regularization is introduced to the well-known one-class collaborative filtering problem. To fully benefit from large unlabeled data, we design a data-driven framework, which learns a representation function by not only transferring raw features of users or items into latent ones but also directly linking the relation between the latent features and user behaviors. The framework is expected to bring cluster hypothesis from machine learning to recommendation, that is, more similar transferred features can bring more similar user behaviors. The experiments have been conducted on two real datasets. The results demonstrate that the learned representation through our framework can boost prediction performance significantly.

Yen-Chieh Lien, Pu-Jen Cheng

Stable Bayesian Optimization

Tuning hyperparameters of machine learning models is important for their performance. Bayesian optimization has recently emerged as a de-facto method for this task. The hyperparameter tuning is usually performed by looking at model performance on a validation set. Bayesian optimization is used to find the hyperparameter set corresponding to the best model performance. However, in many cases, where training or validation set has limited set of datapoints, the function representing the model performance on the validation set contains several spurious sharp peaks. The Bayesian optimization, in such cases, has a tendency to converge to sharp peaks instead of other more stable peaks. When a model trained using these hyperparameters is deployed in real world, its performance suffers dramatically. We address this problem through a novel stable Bayesian optimization framework. We construct a new acquisition function that helps Bayesian optimization to avoid the convergence to the sharp peaks. We conduct a theoretical analysis and guarantee that Bayesian optimization using the proposed acquisition function prefers stable peaks over unstable ones. Experiments with synthetic function optimization and hyperparameter tuning for Support Vector Machines show the effectiveness of our proposed framework.

Thanh Dai Nguyen, Sunil Gupta, Santu Rana, Svetha Venkatesh

An Exponential Time-Aware Recommendation Model for Mobile Notification Services

Mobile notifications attract users’ attention with minimum interruption. It is intriguing to study how to utilize such notifications for personal content recommendation. Recommendation for mobile notification services is nontrivial due to the following challenges: (1) A user may be bothered when receiving many irrelevant or uninterested notifications; (2) Notifications are newly produced without feedbacks before pushed out; (3) Notifications are time-sensitive, and are significantly affected by the time when users receive them. To address these challenges, we propose an exponential time-aware recommendation model. Firstly, based on traces covering 155,141 users receiving 1,464 notification messages provided by NextMedia (http://www.nextmedia.com/), we build an exponential-decaying model to reflect the timeliness of notifications. Secondly, we design a temporal preference model to capture users’ willingness to open notifications over time. Finally, we use LDA to get users’ content preferences and incorporate the two models to provide time-varying mobile notification services. Our experimental results show that our model achieves 15$$\%$$ improvement in precision against the vanilla LDA method.

Chenglin Zeng, Laizhong Cui, Zhi Wang

Discovering Periodic Patterns in Non-uniform Temporal Databases

A temporal database is a collection of transactions, ordered by their timestamps. Discovering periodic patterns in temporal databases has numerous applications. However, to the best of our knowledge, no work has considered mining periodic patterns in temporal databases where items have dissimilar support and periodicity, despite that this type of data is very common in real-life. Discovering periodic patterns in such non-uniform temporal databases is challenging. It requires defining (i) an appropriate measure to assess the periodic interestingness of patterns, and (ii) a method to efficiently find all periodic patterns. While a pattern-growth approach can be employed for the second sub-task, the first sub-task has to the best of our knowledge not been addressed. Moreover, how these two tasks are combined has significant implications. In this paper, we address this challenge. We introduce a model to assess the periodic interestingness of patterns in databases having a non-uniform item distribution, which considers that periodic patterns may have different period and minimum number of cyclic repetitions. Moreover, the paper introduces a pattern-growth algorithm to efficiently discover all periodic patterns. Experimental results demonstrate that the proposed algorithm is efficient and the proposed model may be utilized to find prior knowledge about event keywords and their associations in Twitter data.

R. Uday Kiran, J. N. Venkatesh, Philippe Fournier-Viger, Masashi Toyoda, P. Krishna Reddy, Masaru Kitsuregawa

Discovering Both Explicit and Implicit Similarities for Cross-Domain Recommendation

Recommender System has become one of the most important techniques for businesses today. Improving its performance requires a thorough understanding of latent similarities among users and items. This issue is addressable given recent abundance of datasets across domains. However, the question of how to utilize this cross-domain rich information to improve recommendation performance is still an open problem. In this paper, we propose a cross-domain recommender as the first algorithm utilizing both explicit and implicit similarities between datasets across sources for performance improvement. Validated on real-world datasets, our proposed idea outperforms the current cross-domain recommendation methods by more than 2 times. Yet, the more interesting observation is that both explicit and implicit similarities between datasets help to better suggest unknown information from cross-domain sources.

Quan Do, Wei Liu, Fang Chen

Mining Recurrent Patterns in a Dynamic Attributed Graph

A great number of applications require to analyze a single attributed graph that changes over time. This task is particularly complex because both graph structure and attributes associated with each node can change. In the present work, we focus on the discovery of recurrent patterns in such a graph. These patterns are sequences of subgraphs which represent recurring evolutions of nodes w.r.t. their attributes. Various constraints have been defined and an original algorithm has been developed. Experiments performed on synthetic and real-world datasets have demonstrated the interest of our approach and its scalability.

Zhi Cheng, Frédéric Flouvat, Nazha Selmaoui-Folcher

SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases

The quest for frequent itemsets in a transactional database is explored in this paper, for the purpose of extracting hidden patterns from the database. Two major limitations of the Apriori algorithm are tackled, (i) the scan of the entire database at each pass to calculate the support of all generated itemsets, and (ii) its high sensitivity to variations of the minimum support threshold defined by the user. To deal with these limitations, a novel approach is proposed in this paper. The proposed approach, called Single Scan Frequent Itemsets Mining (SS-FIM), requires a single scan of the transactional database to extract the frequent itemsets. It has a unique feature to allow the generation of a fixed number of candidate itemsets, independently from the minimum support threshold, which intuitively allows to reduce the cost in terms of runtime for large databases. SS-FIM is compared with Apriori using several standard databases. The results confirm the scalability of SS-FIM and clearly show its superiority compared to Apriori for medium and large databases.

Youcef Djenouri, Marco Comuzzi, Djamel Djenouri

Multi-view Regularized Gaussian Processes

Gaussian processes (GPs) have been proven to be powerful tools in various areas of machine learning. However, there are very few applications of GPs in the scenario of multi-view learning. In this paper, we present a new GP model for multi-view learning. Unlike the existing methods, it combines multiple views by regularizing marginal likelihood with the consistency among the posterior distributions of latent functions from different views. Moreover, we give a general point selection scheme for multi-view learning and improve the proposed model by this criterion. Experimental results on multiple real world data sets have verified the effectiveness of the proposed model and witnessed the performance improvement through employing this novel point selection scheme.

Qiuyang Liu, Shiliang Sun

A Neural Network Model for Semi-supervised Review Aspect Identification

Aspect identification is an important problem in opinion mining. It is usually solved in an unsupervised manner, and topic models have been widely used for the task. In this work, we propose a neural network model to identify aspects from reviews by learning their distributional vectors. A key difference of our neural network model from topic models is that we do not use multinomial word distributions but instead embedding vectors to generate words. Furthermore, to leverage review sentences labeled with aspect words, a sequence labeler based on Recurrent Neural Networks (RNNs) is incorporated into our neural network. The resulting model can therefore learn better aspect representations. Experimental results on two datasets from different domains show that our proposed model can outperform a few baselines in terms of aspect quality, perplexity and sentence clustering results.

Ying Ding, Changlong Yu, Jing Jiang

Behavioral Data Mining

Frontmatter

Unsupervised Embedding for Latent Similarity by Modeling Heterogeneous MOOC Data

Recent years have witnessed the prosperity of Massive Open Online Courses (MOOCs). One important characteristic of MOOCs is that video clips and discussion forum are integrated into a one-stop learning setting. However, discussion forums have been in disorder and chaos due to ‘Massive’ and lack of efficient management. A technical solution is to associate MOOC forum threads to corresponding video clips, which can be regarded as a problem of representation learning. Traditional textual representation, e.g. Bag-of-words (BOW), do not consider the latent semantics, while recent semantic word embeddings, e.g. Word2vec, do not capture the similarity between documents, i.e. latent similarity. So learning distinguishable textual representation is the key to resolve the problem. In this paper, we propose an effective approach called No-label Sequence Embedding (NOSE) which can capture not only the latent semantics within words and documents, but also the latent similarity. We model multiform MOOC data in a heterogeneous textual network. And we learn the low-dimensional embeddings without labels. Our proposed NOSE owns some advantages, e.g. course-agnostic, and few parameters to tune. Experimental results suggest the learned textual representation can outperform the state-of-the-art unsupervised counterparts in the task of associating forum threads to video clips.

Zhuoxuan Jiang, Shanshan Feng, Weizheng Chen, Guangtao Wang, Xiaoming Li

Matrix-Based Method for Inferring Variable Labels Using Outlines of Data in Data Jackets

Data Jacket (DJ) is a technique for sharing information about data and for considering the potential value of datasets, with the data itself hidden, by describing the summary of data in natural language. In DJs, variables are described by variable labels (VLs), which are the names/meanings of variables, and the utility of data is estimated through the discussion about combinations of VLs. However, DJs do not always contain VLs, because the description rule of DJs cannot force data owners to enter all the information about their data. Due to the lack of VLs in some DJs, even if DJs are related to each other, the connection cannot be made through string matching of VLs. In this paper, we propose a method for inferring VLs in DJs whose VLs are unknown, using the texts in outlines of DJs. We specifically focus on the similarity of the outlines of DJs and created two models for inferring VLs, i.e., the similarity of the outlines and the co-occurrence of VLs. The results of experiments show that our method works significantly better than the method using only the string matching of VLs.

Teruaki Hayashi, Yukio Ohsawa

Integrating Reviews into Personalized Ranking for Cold Start Recommendation

Item recommendation task predicts a personalized ranking over a set of items for individual user. One paradigm is the rating-based methods that concentrate on explicit feedbacks and hence face the difficulties in collecting them. Meanwhile, the ranking-based methods are presented with rated items and then rank the rated above the unrated. This paradigm uses widely available implicit feedback but it usually ignores some important information: item reviews. Item reviews not only justify the preferences of users, but also help alleviate the cold-start problem that fails the collaborative filtering. In this paper, we propose two novel and simple models to integrate item reviews into matrix factorization based Bayesian personalized ranking (BPR-MF). In each model, we make use of text features extracted from item reviews via word embeddings. On top of text features we uncover the review dimensions that explain the variation in users’ feedback and these review factors represent a prior preference of a user. Experiments on real-world data sets show the benefits of leveraging item reviews on ranking prediction. We also conduct analyses to understand the proposed models.

Guang-Neng Hu, Xin-Yu Dai

Taste or Addiction?: Using Play Logs to Infer Song Selection Motivation

Online music services are increasing in popularity. They enable us to analyze people’s music listening behavior based on play logs. Although it is known that people listen to music based on topic (e.g., rock or jazz), we assume that when a user is addicted to an artist, s/he chooses the artist’s songs regardless of topic. Based on this assumption, in this paper, we propose a probabilistic model to analyze people’s music listening behavior. Our main contributions are three-fold. First, to the best of our knowledge, this is the first study modeling music listening behavior by taking into account the influence of addiction to artists. Second, by using real-world datasets of play logs, we showed the effectiveness of our proposed model. Third, we carried out qualitative experiments and showed that taking addiction into account enables us to analyze music listening behavior from a new viewpoint in terms of how people listen to music according to the time of day, how an artist’s songs are listened to by people, etc. We also discuss the possibility of applying the analysis results to applications such as artist similarity computation and song recommendation.

Kosetsu Tsukuda, Masataka Goto

Understanding Drivers’ Safety by Fusing Large Scale Vehicle Recorder Dataset and Heterogeneous Circumstantial Data

We present a method of analyzing the relationships between driver characteristics and driving behaviors on the basis of fusing heterogeneous datasources with large-scale vehicle recorder data. It can be used, for example, by fleet managers to classify drivers by their skill level, safety, physical/mental fatigue, aggressiveness, and so on. Previous studies relied on precise data obtained in only critical driving situations and did not consider their circumstances, such as road width and weather. In contrast, our approach takes into account not only a large-scale (over 100 fleet drivers) and long-term (one year’s worth) records of driving operations, but also their circumstances. In this study, we focused on classifying drivers by their accident history and examined the correlation between having an accident and driving behavior. Our method was able to reliably predict whether a driver had recently experienced an accident (f-measure $$=$$ 72%) by taking into account both circumstantial information and velocity at the same time. This level of performance cannot be achieved using only the drivers’ demographic information or kinematic variables of operation records.

Daisaku Yokoyama, Masashi Toyoda, Masaru Kitsuregawa

Graph Clustering and Community Detection

Frontmatter

Query-oriented Graph Clustering

There are many tasks including diversified ranking and social circle discovery focusing on the relationship between data as well as the relevance to the query. These applications are actually related to query-oriented clustering. In this paper, we firstly formulate the problem, query-oriented clustering, in a general form and propose the two measures, query-oriented normalized cut (QNCut) and cluster balance to evaluate the results for query-oriented clustering. We develop a model, query-oriented graph clustering (QGC), that combines QNCut and the balance constraint based on cluster balance in a quadratic form. In the experiments, we show that QGC achieves promising results on improvement in query-oriented clustering and social circle discovery.

Li-Yen Kuo, Chung-Kuang Chou, Ming-Syan Chen

CCCG: Clique Conversion Ratio Driven Clustering of Graphs

Networks have become ubiquitous in many real world applications and to cluster similar networks is an important problem. There are various properties of graphs such as clustering coefficient (CC), density, arboricity, etc. We introduce a measure, Clique Conversion Coefficient (CCC), which captures the clique forming tendency of nodes in an undirected graph. CCC could either be used as a weighted average of the values in a vector or as the vector itself. Our experiments show that CCC provides additional information about a graph in comparison to related measures like CC and density. We cluster the real world graphs using a combination of the features CCC, CC, and density and show that without CCC as one of the features, graphs with similar clique forming tendencies are not clustered together. The clustering with the use of CCC would have applications in the areas of Social Network Analysis, Protein-Protein Interaction Analysis, etc., where cliques have an important role. We perform the clustering of ego networks of the YOUTUBE network using values in CCC vector as features. The quality of the clustering is analyzed by contrasting the frequent subgraphs in each cluster. The results highlight the utility of CCC in clustering subgraphs of a large graph.

Prathyush Sambaturu, Kamalakar Karlapalem

Mining Cohesive Clusters with Interpretations in Labeled Graphs

In recent years, community detection on plain graphs has been widely studied. With the proliferation of available data, each user in the network is usually associated with additional attributes for elaborate description. However, many existing methods only focus on the topological structure and fail to deal with node-attributed networks. These approaches cannot extract clear semantic meanings for communities detected. In this paper, we combine the topological structure and attribute information into a unified process and propose a novel algorithm to detect overlapping semantic communities. The proposed algorithm is divided into three phases. Firstly, we detect local semantic subcommunities from each node’s perspective using a greedy strategy. Then, a supergraph which consists of all these subcommunities is created. Finally, we find global semantic communities on the supergraph. The experimental results on real-world datasets show the efficiency and effectiveness of our approach against other state-of-the-art methods.

Hongxia Du, Heli Sun, Jianbin Huang, Zhongbin Sun, Liang He, Hong Cheng

A SAT-Based Framework for Overlapping Community Detection in Networks

In this paper, we propose a new approach to detect overlapping communities in large complex networks. We first introduce a parametrized notion of a community, called k-linked community, allowing us to characterize node/edge centered k-linked community with bounded diameter. Such community admits a node or an edge with a distance at most $$\frac{k}{2}$$ from any other node of that community. Next, we show how the problem of detecting node/edge centered k-linked overlapping communities can be expressed as a Partial Max-SAT optimization problem. Then, we propose a post-processing strategy to limit the overlaps between communities. An extensive experimental evaluation on real-world networks shows that our approach outperforms several popular algorithms in detecting relevant communities.

Said Jabbour, Nizar Mhadhbi, Badran Raddaoui, Lakhdar Sais

Dimensionality Reduction

Frontmatter

Denoising Autoencoder as an Effective Dimensionality Reduction and Clustering of Text Data

Deep learning methods are widely used in vision and face recognition, however there is a real lack of application of such methods in the field of text data. In this context, the data is often represented by a sparse high dimensional document-term matrix. Dealing with such data matrices, we present, in this paper, a new denoising auto-encoder for dimensionality reduction, where each document is not only affected by its own information, but also affected by the information from its neighbors according to the cosine similarity measure. It turns out that the proposed auto-encoder can discover the low dimensional embeddings, and as a result reveal the underlying effective manifold structure. The visual representation of these embeddings suggests the suitability of performing the clustering on the set of documents relying on the Expectation-Maximization algorithm for Gaussian mixture models. On real-world datasets, the relevance of the presented auto-encoder in the visualisation and document clustering field is shown by a comparison with five widely used unsupervised dimensionality reduction methods including the classic auto-encoder.

Milad Leyli-Abadi, Lazhar Labiod, Mohamed Nadif

Gradable Adjective Embedding for Commonsense Knowledge

Adjective understanding is crucial for answering qualitative or subjective questions, such as “is New York a big city”, yet not as sufficiently studied as answering factoid questions. Our goal is to project adjectives in the continuous distributional space, which enables to answer not only the qualitative question example above, but also comparative ones, such as “is New York bigger than San Francisco?”. As a basis, we build on the probability P(New York—big city) and P(Boston—big city) observed in Hearst patterns from a large Web corpus (as captured in a probabilistic knowledge base such as Probase). From this base model, we observe that this probability well predicts the graded score of adjective, but only for “head entities” with sufficient observations. However, the observation of a city is scattered to many adjectives – Cities are described with 194 adjectives in Probase, and, on average, only 2% of cities are sufficiently observed in adjective-modified concepts. Our goal is to train a distributional model such that any entity can be associated to any adjective by its distance from the vector of ‘big city’ concept. To overcome sparsity, we learn highly synonymous adjectives, such as big and huge cities, to improve prediction accuracy. We validate our finding with real-word knowledge bases.

Kyungjae Lee, Hyunsouk Cho, Seung-won Hwang

Combining Dimensionality Reduction with Random Forests for Multi-label Classification Under Interactivity Constraints

Learning from multi-label data in an interactive framework is a challenging problem as algorithms must withstand some additional constraints: in particular, learning from few training examples in a limited time. A recent study of multi-label classifier behaviors in this context has identified the potential of the ensemble method “Random Forest of Predictive Clustering Trees” (RF-PCT). However, RF-PCT has shown a degraded performance in terms of computation time for large feature spaces. To overcome this limit, this paper proposes a new hybrid multi-label learning approach IDSR-RF (Independent Dual Space Reduction with RF-PCT) which first reduces the data dimension and then learns a predictive regression model in the reduced spaces with RF-PCT. The feature and the label spaces are independently reduced using the fast matrix factorization algorithm Gravity. The experimental results on nine high-dimensional datasets show that IDSR-RF significantly reduces the computation time without deteriorating the learning performances. To the best of our knowledge, it is currently the most promising learning approach for an interactive multi-label learning system.

Noureddine-Yassine Nair-Benrekia, Pascale Kuntz, Frank Meyer

A Generalized Model for Multidimensional Intransitivity

Intransitivity is a critical issue in pairwise preference modeling. It refers to the intransitive pairwise preferences between a group of players or objects that potentially form a cyclic preference chain, and has been long discussed in social choice theory in the context of the dominance relationship. However, such multifaceted intransitivity between players and the corresponding player representations in high dimension are difficult to capture. In this paper, we propose a probabilistic model that joint learns the d-dimensional representation ($$d > 1$$) for each player and a dataset-specific metric space that systematically captures the distance metric in $$\mathbb {R}^d$$ over the embedding space. Interestingly, by imposing additional constraints in the metric space, our proposed model degenerates to former models used in intransitive representation learning. Moreover, we present an extensive quantitative investigation of the wide existence of intransitive relationships between objects in various real-world benchmark datasets. To the best of our knowledge, this investigation is the first of this type. The predictive performance of our proposed method on various real-world datasets, including social choice, election, and online game datasets, shows that our proposed method outperforms several competing methods in terms of prediction accuracy.

Jiuding Duan, Jiyi Li, Yukino Baba, Hisashi Kashima

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise