Skip to main content

About this book

This book constitutes the thoroughly refereed post-workshop proceedings of the workshops that were held in conjunction with the 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2019, in Macau, China, in April 2019.

The 31 revised papers presented were carefully reviewed and selected from a total of 52 submissions. They stem from the following workshops:

· PAISI 2019: 14th Pacific Asia Workshop on Intelligence and Security Informatics

· WeL 2019: PAKDD 2019 Workshop on Weakly Supervised Learning: Progress and Future

· LDRC 2019: PAKDD 2019 Workshop on Learning Data Representation for Clustering

· BDM 2019: 8th Workshop on Biologically-inspired Techniques for Knowledge Discovery and Data Mining

· DLKT 2019: 1st Pacific Asia Workshop on Deep Learning for Knowledge Transfer

Table of Contents


The 14th Pacific Asia Workshop on Intelligence and Security Informatics (PAISI 2019)


A Supporting Tool for IT System Security Specification Evaluation Based on ISO/IEC 15408 and ISO/IEC 18045

In evaluation and certification framework based on ISO/IEC 15408 and ISO/IEC 18045, a Security Target, which contains the specifications of all security functions of the target system, is the most important document. Evaluation on Security Targets must be performed as the first step of the whole evaluation process. However, evaluation on Security Targets based on ISO/IEC 15408 and ISO/IEC 18045 is very complex. Evaluation process involves of many tasks and costs lots of time when evaluation works are performed by human. Besides, it is also difficult to ensure that evaluation is fair and no subjective mistakes. These issues not only may result in consuming a lot of time, but also may affect the correctness, accuracy, and fairness of evaluation results. Thus, it is necessary to provide a supporting tools that supports all tasks related to the evaluation process automatically to improve the quality of evaluation results at the same time reduce the complexity of all evaluator and certifiers’ work. However, there is no such supporting tool existing until now. This paper proposes a supporting tool, called Security Target Evaluator, that provides comprehensive facilities to support the whole process of evaluation on Security Targets based on ISO/IEC 15408 and ISO/IEC 18045.
Da Bao, Yuichi Goto, Jingde Cheng

An Investigation on Multi View Based User Behavior Towards Spam Detection in Social Networks

Online Social Networks have become immensely vulnerable for spammers where they spread malicious contents and links. Understanding the behaviors across multiple features are essential for successful spam detection. Majority of the existing methods rely on single view of information for spam detection where diversified spam behaviors may not allow these techniques to be survived. As a result, Multiview solutions are getting emerged. Based on homophily theory, a hypothesis of spammer’s behaviors should be inconsistent across multiple views compare to legitimate user behaviors is defined. We investigated the consistency of the user’s content interest and popularity over multiple topics across multiple views. The results confirm the existence of notable difference of average similarity between legitimate and spam users. It proved that the legitimate user behaviors are consistent across multiple views while spammers are inconsistent. This indicates that consistency of user behavior across multiple views can be used for spam detection.
Darshika Koggalahewa, Yue Xu

A Cluster Ensemble Strategy for Asian Handicap Betting

Football betting has grown rapidly in the past two decades, among which fixed odds betting and Asian handicap betting are the most popular mechanisms. Much previous research work mainly focus on fixed odds betting, however, it is lack of studying on Asian handicap betting. In this paper, we focus on Asian handicap betting and aim to propose an intelligent decision system that can make betting strategy. To achieve this, a cluster ensemble model is presented, which is based on the fact that matches with similar pattern of expected goal trend series may have the same actual outcome. Firstly, we set up the component cluster which classifies matches by the pattern of expected goal trend series and then makes the same betting decision for matches that belong to the same group. Furthermore, we adopt plurality voting approach to integrate component clusters and then determine the final betting strategy. Using this strategy on the big five European football leagues data, it yields a positive return.
Yue Chen, Jian Shi

Designing an Integrated Intelligence Center: New Taipei City Police Department as an Example

The rapid advancement and prevalence of Internet technology was the biggest development of the 20th century. Because criminals use the Internet to commit crimes, police work has needed to shift from traditional methods to modern technology. As investigation strategies have evolved, sharable databases and integrated intelligence have become increasingly important. By maintaining data, information professionals play a crucial role in police departments, and an integrated, sharable system is essential for supporting police work.
New Taipei City is one of six main cities in Taiwan and has the largest population. Social order and traffic are the city’s greatest issues. The New Taipei City Police Department (NTPD) uses integrated resources and new technology to help first-line police and investigators focus on these issues. To solve complicated problems, an integrated intelligence center (IIC) was designed to provide needed data, help team members analyze information, and guide users in searching information systems.
The IIC’s services have successfully supported social maintenance tasks such as covering celebration events and city elections. The support of IIC team members has been recognized with Outstanding Government Employee and Best Contribution awards from the Taiwanese government. This paper shows the IIC’s structure and how it is designed with innovation approach how it functions.
Chun-Young Chang, Lin-Chien Chien, Yuh-Shyan Hwang

Early Churn User Classification in Social Networking Service Using Attention-Based Long Short-Term Memory

Social networking services (SNSs) see much early churn of new users. SNSs can provide effective interventions by identifying potential early churn users and important factors leading to early churn. The long short-term memory (LSTM) model, whose input is the user behavior event sequence binned at constant intervals, is proposed for this purpose. This model better classifies early churn users than previous machine learning models. We hypothesized that the importance of each temporal part in the event sequence is different for classifying early churn users because user behavior is known to consist of coarse and dense parts and initial behavior influences long-term behavior. To treat this, we proposed attention-based LSTM for classifying early churn users. In an experiment conducted on RoomClip, a general SNS, the proposed model achieved higher classification performance compared to baseline models, thus confirming its effectiveness. We also analyzed the importance of each temporal part and each event. We revealed that the initial temporal part and users’ actions have high importance for classifying early churn users. These results should contribute to providing effective interventions for preventing early churn.
Koya Sato, Mizuki Oka, Kazuhiko Kato

PAKDD 2019 Workshop on Weakly Supervised Learning: Progress and Future (WeL 2019)


Weakly Supervised Learning by a Confusion Matrix of Contexts

Context consideration can help provide more background and related information for weakly supervised learning. The inclusion of less documented historical and environmental context in researching diabetes amongst Pima Indians uncovered reasons which were more likely to explain why some Pima Indians had much higher rates of diabetes than Caucasians, primarily due to historical, environmental and social causes rather than their specific genetic patterns or ethnicity as suggested by many medical studies.
If historical and environmental factors are considered as external contexts when not included as part of a dataset for research, some forms of internal contexts may also exist inside the dataset without being declared. This paper discusses a context construction model that transforms a confusion matrix into a matrix of categorical, incremental and correlational context to emulate a kind of internal context to search for more informative patterns in order to improve weakly supervised learning from limited labeled samples for unlabeled data.
When the negative and positive labeled samples and misclassification errors are compared to “happy families” and “unhappy families”, the contexts constructed by this model in the classification experiments reflected the Anna Karenina principle well - “Happy families are all alike; every unhappy family is unhappy in its own way”, an encouraging sign to further explore contexts associated with harmonizing patterns and divisive causes for knowledge discovery in a world of uncertainty.
William Wu

Learning a Semantic Space for Modeling Images, Tags and Feelings in Cross-Media Search

This paper contributes a new, real-world web image dataset for cross-media retrieval called FB5K. The proposed FB5K dataset contains the following attributes: (1) 5130 images crawled from Facebook; (2) images that are categorized according to users’ feelings; (3) images independent of text and language rather than using feelings for search. Furthermore, we propose a novel approach through the use of Optical Character Recognition (OCR) and explicit incorporation of high-level semantic information. We comprehensively compute the performance of four different subspace-learning methods and three modified versions of the Correspondence Auto Encoder (Corr-AE), alongside numerous text features and similarity measurements comparing Wikipedia, Flickr30k and FB5K. To check the characteristics of FB5K, we propose a semantic-based cross-media retrieval method. To accomplish cross-media retrieval, we introduced a new similarity measurement in the embedded space, which significantly improved system performance compared with the conventional Euclidean distance. Our experimental results demonstrated the efficiency of the proposed retrieval method on three different public datasets.
Sadaqat ur Rehman, Yongfeng Huang, Shanshan Tu, Basharat Ahmad

Adversarial Active Learning in the Presence of Weak and Malicious Oracles

We present a robust active learning technique for situations where there are weak and adversarial oracles. Our work falls under the general umbrella of active learning in which training data is insufficient and oracles are queried to supply labels for the most informative samples to expand the training set. On top of that, we consider problems where a large percentage of oracles may be strategically lying, as in adversarial settings. We present an adversarial active learning technique that explores the duality between oracle modeling and data modeling. We demonstrate on real datasets that our adversarial active learning technique is superior to not only the heuristic majority-voting technique but one of the state-of-the-art adversarial crowdsourcing technique—Generative model of Labels, Abilities, and Difficulties (GLAD), when genuine oracles are outnumbered by weak oracles and malicious oracles, and even in the extreme cases where all the oracles are either weak or malicious. To put our technique under more rigorous tests, we compare our adversarial active learner to the ideal active learner that always receives correct labels. We demonstrate that our technique is as effective as the ideal active learner when only one third of the oracles are genuine.
Yan Zhou, Murat Kantarcioglu, Bowei Xi

The Most Related Knowledge First: A Progressive Domain Adaptation Method

In domain adaptation, how to select and transfer related knowledge is critical for learning. Inspired by the fact that human usually transfer from the more related experience to the less related one, in this paper, we propose a novel progressive domain adaptation (PDA) model, which attempts to transfer source knowledge by considering the transfer order based on relevance. Specifically, PDA transfers source instances iteratively from the most related ones to the least related ones, until all related source instances have been adopted. It is an iterative learning process, source instances adopted in each iteration are determined by a gradually annealed weight such that the later iteration will introduce more source instances. Further, a reverse classification performance is used to set the termination of iteration. Experiments on real datasets demonstrate the competiveness of PDA compared with the state-of-arts.
Yunyun Wang, Dan Zhao, Yun Li, Kejia Chen, Hui Xue

Learning Data Representation for Clustering (LDRC 2019)


Deep Architectures for Joint Clustering and Visualization with Self-organizing Maps

Recent research has demonstrated how deep neural networks are able to learn representations to improve data clustering. By considering representation learning and clustering as a joint task, models learn clustering-friendly spaces and achieve superior performance, compared with standard two-stage approaches where dimensionality reduction and clustering are performed separately. We extend this idea to topology-preserving clustering models, known as self-organizing maps (SOM). First, we present the Deep Embedded Self-Organizing Map (DESOM), a model composed of a fully-connected autoencoder and a custom SOM layer, where the SOM code vectors are learnt jointly with the autoencoder weights. Then, we show that this generic architecture can be extended to image and sequence data by using convolutional and recurrent architectures, and present variants of these models. First results demonstrate advantages of the DESOM architecture in terms of clustering performance, visualization and training time.
Florent Forest, Mustapha Lebbah, Hanane Azzag, Jérôme Lacaille

Deep Cascade of Extra Trees

Deep neural networks have recently become popular because of their success in such domains as image and speech recognition, which has lead many to wonder whether other learners could benefit from deep, layered architectures. In this paper, we propose the Deep Cascade of Extra Trees (DCET) model. Representation learning in deep neural networks mostly relies on the layer-by-layer processing of raw features. Inspired by this, DCET uses a deep cascade of decision forests structure, where the cascade in each level receives the best feature information processed by the cascade of forests of its preceding level. Experiments show that its performance is quite robust regarding hyper-parameter settings; in most cases, even across different datasets from different domains, it is able to get excellent performance by using the same default setting.
Abdelkader Berrouachedi, Rakia Jaziri, Gilles Bernard

Algorithms for an Efficient Tensor Biclustering

Consider a data set collected by (individuals-features) pairs in different times. It can be represented as a tensor of three dimensions (Individuals, features and times). The tensor biclustering problem computes a subset of individuals and a subset of features whose signal trajectories over time lie in a low-dimensional subspace, modeling similarity among the signal trajectories while allowing different scalings across different individuals or different features. This approach are based on spectral decomposition in order to build the desired biclusters. We evaluate the quality of the results from each algorithms with both synthetic and real data set.
Dina Faneva Andriantsiory, Mustapha Lebbah, Hanane Azzag, Gael Beck

Change Point Detection in Periodic Panel Data Using a Mixture-Model-Based Approach

This paper describes a novel method for common change detection in panel data emanating from smart electricity and water networks. The proposed method relies on a representation of the data by classes whose probabilities of occurrence evolve over time. This dynamics is assumed to be piecewise periodic due to the cyclic nature of the studied data, which allows the detection of change points. Our strategy is based on a hierarchical mixture of t-distributions which entails some robustness properties. The parameter estimation is performed using an incremental strategy, which has the advantage to allow the processing of large datasets. The experiments carried out on realistic data showed the full relevance of the proposed method.
Allou Samé, Milad Leyli-Abadi

The 8th Workshop on Biologically-Inspired Techniques for Knowledge Discovery and Data Mining (BDM 2019)


Neural Network-Based Deep Encoding for Mixed-Attribute Data Classification

This paper proposes a neural network-based deep encoding (DE) method for the mixed-attribute data classification. DE method first uses the existing one-hot encoding (OE) method to encode the discrete-attribute data. Second, DE method trains an improved neural network to classify the OE-attribute data corresponding to the discrete-attribute data. The loss function of improved neural network not only includes the training error but also considers the uncertainty of hidden-layer output matrix (i.e., DE-attribute data), where the uncertainty is calculated with the re-substitution entropy. Third, the classification task is conducted based on the combination of previous continuous-attribute data and transformed DE-attribute data. Finally, we compare DE method with OE method by training support vector machine (SVM) and deep neural network (DNN) on 4 KEEL mixed-attribute data sets. The experimental results demonstrate the feasibility and effectiveness of DE method and show that DE method can help SVM and DNN obtain the better classification accuracies than the traditional OE method.
Tinglin Huang, Yulin He, Dexin Dai, Wenting Wang, Joshua Zhexue Huang

Protein Complexes Detection Based on Deep Neural Network

Protein complexes play an important role for scientists to explore the secrets of cell and life. Most of the existing protein complexes detection methods utilize traditional clustering algorithms on protein-protein interaction (PPI) networks. However, due to the complexity of the network structure, traditional clustering methods cannot capture the network information effectively. Therefore, how to extract information from high-dimensional networks has become a challenge. In this paper, we propose a novel protein complexes detection method called DANE, which uses a deep neural network to maintain the primary information. Furthermore, we use a deep autoencoder framework to implement the embedding process, which preserves the network structure and the additional biological information. Then, we use the clustering method based on the core-attachment principle to get the prediction result. The experiments on six yeast datasets with five other detection methods show that our method gets better performance.
Xianchao Zhang, Peixu Gao, Maohua Sun, Linlin Zong, Bo Xu

Predicting Auction Price of Vehicle License Plate with Deep Residual Learning

Due to superstition, license plates with desirable combinations of characters are highly sought after in China, fetching prices that can reach into the millions in government-held auctions. Despite the high stakes involved, there has been essentially no attempt to provide price estimates for license plates. We present an end-to-end neural network model that simultaneously predict the auction price, gives the distribution of prices and produces latent feature vectors. While both types of neural network architectures we consider outperform simpler machine learning methods, convolutional networks outperform recurrent networks for comparable training time or model complexity. The resulting model powers our online price estimator and search engine.
Vinci Chow

Mining Multispectral Aerial Images for Automatic Detection of Strategic Bridge Locations for Disaster Relief Missions

We propose in this paper an image mining technique based on multispectral aerial images for automatic detection of strategic bridge locations for disaster relief missions. Bridge detection from aerial images is a key landmark that has vital importance in disaster management and relief missions. UAVs have been increasingly used in recent years for various relief missions during the natural disasters such as floods and earthquakes and a huge amount of multispectral aerial images are generated by UAVs in the missions. Being a multi- stage technique, our method utilizes these multispectral aerial images for identifying patterns for effective mining of bridge locations. Experimental results on real-world and synthetic images are conducted to demonstrate the effectiveness of our proposed method, showing that it is 40% faster than the existing Automatic Target Recognition (ATR) systems and can achieve a 95% accuracy. Our technique is believed to be able to help accelerate and enhance the effectiveness of the relief missions carried out during disasters.
Hafiz Suliman Munawar, Ji Zhang, Hongzhou Li, Deqing Mo, Liang Chang

Spike Sorting with Locally Weighted Co-association Matrix-Based Spectral Clustering

Spike sorting for neuron recordings is one of the core tasks in brain function studies. Spike sorting always consists of spike detection, feature extraction and clustering. Most of the clustering algorithms adopted in spike sorting schemes are subject to the shapes and structures of the signal except the spectral clustering algorithm. To improve the performance of spectral clustering algorithm for spike sorting, in this paper, a locally weighted co-association matrix is employed as the similarity matrix and the Shannon entropy is also introduced to measure the dependability of clustering. Experimental results show that the performance of spike sorting with the improved spectral clustering algorithm is superior to that of spike sorting with other classic clustering algorithms.
Wei Ji, Zhenbin Li, Yun Li

Label Distribution Learning Based Age-Invariant Face Recognition

Face recognition is an important application of computer vision. Al-though the accuracy of face recognition is high, face recognition and retrieval across age is still challenging. Faces across age can be very different caused by the aging process over time. The problem is that the images are not too similar, but with the same label. To reduce the intraclass discrepancy, in this paper we pro-pose a new method called Label Distribution learning for the end-to-end neural network to learn more discriminative features. Extensive experiments conducted on the three public domain face aging datasets (MORPH Album 2, CACD-VS and LFW) have shown the effectiveness of the proposed approach.
Hai Huang, Senlin Cheng, Zhong Hong, Liutong Xu

Overall Loss for Deep Neural Networks

Convolutional Neural Network (CNN) have been widely used for image classification and computer vision tasks such as face recognition, target detection. Softmax loss is one of the most commonly used components to train CNN, which only penalizes the classification loss. So we consider how to train intra-class compactness and inter-class separability better. In this paper, we proposed an Overall Loss to make inter-class having a better separability, which means that Overall loss penalizes the difference between each center of classes. With Overall loss, we trained a robust CNN to achieve a better performance. Extensive experiments on MNIST, CIFAR10, LFW (face datasets for face recognition) demonstrate the effectiveness of the Overall loss. We have tried different models, visualized the experimental results and showed the effectiveness of our proposed Overall loss.
Hai Huang, Senlin Cheng, Liutong Xu

Sentiment Analysis Based on LSTM Architecture with Emoticon Attention

Sentiment analysis is one of the most important research directions in natural language processing field. People increasingly use emoticons in text to express their sentiment. However, most existing algorithms for sentiment classification only focus on text information but don’t full make use of the emoticon information. To address this issue, we propose a novel LSTM architecture with emoticon attention to incorporate emoticon information into sentiment analysis. Emoticon attention is employed to use emoticons to capture crucial semantic components. To evaluate the efficiency of our model, we build the first sentiment corpus with rich emoticons from movie review website and we use it as our experiment dataset. Experiments results show that our approach is able to better use emoticon information to improve the performance on sentiment analysis.
Changliang Li, Changsong Li, Pengyuan Liu

Aspect Level Sentiment Analysis with Aspect Attention

Aspect level sentiment classification is a fundamental task in the field of sentiment analysis, which goal is to inferring sentiment on entities mentioned within texts or aspects of them. Since it performs finer-grained analysis, aspect level sentiment classification is more challenging. Recently, neural network approaches, such as LSTMs, have achieved much progress in sentiment analysis. However, most neural models capture little aspect information in sentences. Aspect level sentiment of a sentence is determined not only by the content but also by the concerned aspect. In this paper, we propose a novel LSTM with Aspect Attention model (LSTM_AA) for aspect level sentiment classification. Our model introduces aspect attention to relate the aspect level sentiment of a sentence closely to the concerned aspect, as well as to explore the connection between an aspect and the content of a sentence. We experiment on the SemEval 2014 datasets and results show that our model performs comparable to state-of-the-art deep memory network, and substantially better than other neural network approaches. Besides, our approach is more robust than deep memory network which performance heavily depends on the hops.
Changliang Li, Hailiang Wang, Saike He

The 1st Pacific Asia Workshop on Deep Learning for Knowledge Transfer (DLKT 2019)


Transfer Channel Pruning for Compressing Deep Domain Adaptation Models

Deep unsupervised domain adaptation has recently received increasing attention from researchers. However, existing methods are computationally intensive due to the computational cost of CNN (Convolutional Neural Networks) adopted by most work. There is no effective network compression method for such problem. In this paper, we propose a unified Transfer Channel Pruning (TCP) approach for accelerating deep unsupervised domain adaptation (UDA) models. TCP is capable of compressing the deep UDA model by pruning less important channels while simultaneously learning transferable features by reducing the cross-domain distribution divergence. Therefore, it reduces the impact of negative transfer and maintains competitive performance on the target task. To the best of our knowledge, TCP is the first approach that aims at accelerating deep unsupervised domain adaptation models. TCP is validated on two benchmark datasets – Office-31 and ImageCLEF-DA with two common backbone networks – VGG16 and ResNet50. Experimental results demonstrate that TCP achieves comparable or better classification accuracy than other comparison methods while significantly reducing the computational cost. To be more specific, in VGG16, we get even higher accuracy after pruning 26% floating point operations (FLOPs); in ResNet50, we also get higher accuracy on half of the tasks after pruning 12% FLOPs.
Chaohui Yu, Jindong Wang, Yiqiang Chen, Zijing Wu

A Heterogeneous Domain Adversarial Neural Network for Trans-Domain Behavioral Targeting

To realize trans-domain behavioral targeting, which targets interested potential users of a source domain (e.g. E-Commerce) based on their behaviors in a target domain (e.g. Ad-Network), heterogeneous transfer learning (HeTL) is a promising technique for modeling behavior linkage between domains. It is required for HeTL to learn three functionalities: representation alignment, distribution alignment, and classification. In our previous work, we prototyped and evaluated two typical transfer learning algorithms, but neither of them jointly learns the three desired functionalities. Recent advances in transfer learning include a domain-adversarial neural network (DANN), which jointly learns distribution alignment and classification. In this paper, we extended DANN to be able to learn representation alignment by simply replacing its shared encoder with domain-specific types, so that it jointly learns the three desired functionalities. We evaluated the effectiveness of the joint learning of the three functionalities using real-world data of two domains: E-Commerce, which is set as the source domain, and Ad Network, which is set as the target domain.
Kei Yonekawa, Hao Niu, Mori Kurokawa, Arei Kobayashi, Daichi Amagata, Takuya Maekawa, Takahiro Hara

Natural Language Business Intelligence Question Answering Through SeqtoSeq Transfer Learning

Enterprise data is usually stored in the form of relational databases. Question Answering systems provides an easier way so that business analysts can get data insights without struggling with the syntax of SQL. However, building a supervised machine learning based question answering system is a challenging task involving large manual annotations for a specific domain. In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., Training samples (NL questions) on IT enetrprize domain) is used to improve performance on a target task with fewer available annotations (e.g., Training samples (NL questions) on pharmaceutical domain). We examine the effects of transfer learning for deep recurrent networks across domains and show that significant improvement can often be obtained. Our question answering framework is based on a set of machine learning models that create an intermediate sketch from a natural language query. Using the intermediate sketch, we generate a final database query over a large knowledge graph. Our framework supports multiple queries such as aggregation, self joins, factoid and transnational.
Amit Sangroya, Pratik Saini, Mrinal Rawat, Gautam Shroff, C. Anantaram

Robust Faster R-CNN: Increasing Robustness to Occlusions and Multi-scale Objects

Recognizing objects at vastly different scales and objects with occlusion is a fundamental challenge in computer vision. In this paper, we propose a novel method called Robust Faster R-CNN for detecting objects in multi-label images. The framework is based on Faster R-CNN architecture. We improve the Faster R-CNN by replacing ROIpoolings with ROIAligns to remove the harsh quantization of RoIPool and we design multi-ROIAligns by adding different sizes’ pooling(Aligns operation) in order to adapt to different sizes of objects. Furthermore, we adopt multi-feature fusion to enhance the ability to recognize small objects. In model training, we train an adversarial network to generate examples with occlusions and combine it with our model to make our model invariant to occlusions. Experimental results on Pascal VOC 2012 and 2007 datasets demonstrate the superiority of the proposed approach over many state-of-the-arts approaches.
Tao Zhou, Zhixin Li, Canlong Zhang

Effectively Representing Short Text via the Improved Semantic Feature Space Mapping

Short text representation (STR) has attracted increasing interests recently with the rapid growth of Web and social media data existing in short text form. In this paper, we present a new method using an improved semantic feature space mapping to effectively represent short texts. Firstly, semantic clustering of terms is performed based on statistical analysis and word2vec, and the semantic feature space can then be represented via the cluster center. Then, the context information of terms is integrated with the semantic feature space, based on which three improved similarity calculation methods are established. Thereafter the text mapping matrix is constructed for short text representation learning. Experiments on both Chinese and English test collections show that the proposed method can well reflect the semantic information of short texts and represent the short texts reasonably and effectively.
Ting Tuo, Huifang Ma, Haijiao Liu, Jiahui Wei

Probabilistic Graphical Model Based Highly Scalable Directed Community Detection Algorithm

Community detection algorithms have essential applications for character statistics in complex network which could contribute to the study of the real network, such as the online social network and the logistics distribution network. But traditional community detection algorithms could not handle the significant characteristic of directionality in real network for only concentrating on undirected network. Based on Information Transfer Probability method of classic Probabilistic Graphical Model (PGM) theory from Turing Award Owner Pearl, we propose an efficient local directed community detection method named Information Transfer Gain (ITG) from basic information transfer triangles which composed the core structure of community. Then, aiming at processing the large scale directed social network with high efficiency, we propose the scalable and distributed algorithm of Distributed Information Transfer Gain (DITG) based on GraphX model in Spark. Finally, with extensive experiment on directed artificial network dataset and real social network dataset, we prove that our algorithm have good precision and efficiency in distributed environment compared with some classical directed detection algorithms such as FastGN, OSLOM and Infomap.
XiaoLong Deng, ZiXiang Nie, JiaYu Zhai

Hilltop Based Recommendation in Co-author Networks

The scale of projects and literatures have been continuously expanded and become more complex with the development of scientific research. Scientific cooperation has become an important trend in the scientific research. Analysis of the co-author network is a big data problem. Without enough data mining, the research cooperation will be limited to some same group, named as “small group” in the co-author networks. This situation has led to the researchers’ lack of openness and limited scientific research results. It is important to recommend some potential collaboration from huge amount of literature. We propose a method based on Hilltop algorithm, an algorithm in search engine, to recommend co-authors by link analysis. The candidate set is screening and scored for recommendation. By setting certain rules, the expert set formation of the Hilltop algorithm is added to the screening. And the score is calculated by the durations and times of the collaborations. The co-authors can be extracted and recommended from the big data of the scientific research literatures through the experiments.
Qiong Wu, Xuan Ou, Jianjun Yu, Heliang Yuan

Neural Variational Collaborative Filtering for Top-K Recommendation

Collaborative Filtering (CF) is one of the most widely applied models for recommender systems. However, CF-based methods suffer from data sparsity and cold-start, more attention has been drawn to hybrid methods by using both the rating and content information. Variational Autoencoder (VAE) has been confirmed to be highly effective in CF task, due to its Bayesian nature and non-linearity. Nevertheless, most VAE models suffer from data sparsity, which leads to poor latent representations of users and items. Besides, most existing VAE-based methods model either user latent factors or item latent factors, which makes them unable to recommend items to a new user or recommend a new item to existing users. To address these problems, we propose a novel deep hybrid framework for top-K recommendation, named Neural Variational Collaborative Filtering (NVCF), where user and item side information is incorporated into the generative processes of user and item, to alleviate data sparsity and learn better latent representations of users and items. For inference purpose, we derived a Stochastic Gradient Variational Bayes (SGVB) algorithm to approximate the intractable distributions of latent factors of users and items. Experiments performed on two public datasets have showed our method significantly outperforms the state-of-the-art CF-based and VAE-based methods.
Xiaoyi Deng, Fuzhen Zhuang


Additional information

Premium Partner

    Image Credits