scroll identifier for mobile
main-content

## Über dieses Buch

This two-volume set LNCS 11446 and LNCS 11447 constitutes the refereed proceedings of the 24th International Conference on Database Systems for Advanced Applications, DASFAA 2019, held in Chiang Mai, Thailand, in April 2019.

The 92 full papers and 64 short papers were carefully selected from a total of 501 submissions. In addition, 13 demo papers and 6 tutorial papers are included. The full papers are organized in the following topics: big data; clustering and classification; crowdsourcing; data integration; embedding; graphs; knowledge graph; machine learning; privacy and graph; recommendation; social network; spatial; and spatio-temporal. The short papers, demo papers, and tutorial papers can be found in the volume LNCS 11448, which also includes the workshops of DASFAA 2019.

## Inhaltsverzeichnis

### Correction to: An Exploration of Cross-Modal Retrieval for Unseen Concepts

In the original version of the chapter titled “An Exploration of Cross-Modal Retrieval for Unseen Concepts”, the acknowledgement was missing. It has been added.

Fangming Zhong, Zhikui Chen, Geyong Min

### An Approach Based on Bayesian Networks for Query Selectivity Estimation

The efficiency of a query execution plan depends on the accuracy of the selectivity estimates given to the query optimiser by the cost model. The cost model makes simplifying assumptions in order to produce said estimates in a timely manner. These assumptions lead to selectivity estimation errors that have dramatic effects on the quality of the resulting query execution plans. A convenient assumption that is ubiquitous among current cost models is to assume that attributes are independent with each other. However, it ignores potential correlations which can have a huge negative impact on the accuracy of the cost model. In this paper we attempt to relax the attribute value independence assumption without unreasonably deteriorating the accuracy of the cost model. We propose a novel approach based on a particular type of Bayesian networks called Chow-Liu trees to approximate the distribution of attribute values inside each relation of a database. Our results on the TPC-DS benchmark show that our method is an order of magnitude more precise than other approaches whilst remaining reasonably efficient in terms of time and space.

Max Halford, Philippe Saint-Pierre, Franck Morvan

### An Exploration of Cross-Modal Retrieval for Unseen Concepts

Cross-modal hashing has drawn increasing research interests in cross-modal retrieval due to the explosive growth of multimedia big data. However, most of the existing models are trained and tested in a close-set circumstance, which may easily fail on the newly emerged concepts that are never present in the training stage. In this paper, we propose a novel cross-modal hashing model, named Cross-Modal Attribute Hashing (CMAH), which can handle cross-modal retrieval of unseen categories. Inspired by zero-shot learning, attribute space is employed to transfer knowledge from seen categories to unseen categories. Specifically, the cross-modal hashing functions learning and knowledge transfer are conducted by modeling the relationships among features, attributes, and classes as a dual multi-layer network. In addition, graph regularization and binary constraints are imposed to preserve the local structure information in each modality and to reduce quantization loss, respectively. Extensive experiments are carried out on three datasets, and the results demonstrate the effectiveness of CMAH in handling cross-modal retrieval for both seen and unseen concepts.

Fangming Zhong, Zhikui Chen, Geyong Min

### Continuous Patient-Centric Sequence Generation via Sequentially Coupled Adversarial Learning

Analyzing massive patient-centric Electronic Health Records (EHRs) becomes a key to success for improving health care and treatment. However, the amount of these data is limited and the access to EHRs is difficult due to the issue of patient privacy. Thus high quality synthetic EHRs data is necessary to alleviate these issues. In this paper, we propose a Sequentially Coupled Generative Adversarial Network (SC-GAN) to generate continuous patient-centric data, including patient state and medication dosage data. SC-GAN consists of two generators which coordinate the generation of patient state and medication dosage in a unified model, revealing the clinical fact that the generation of patient state and medication dosage data have noticeable mutual influence on each other. To verify the quality of the synthetic data, we conduct comprehensive experiments to employ these data on real medical tasks, showing that data generated from SC-GAN leads to better performance than the data from other generative models.

Lu Wang, Wei Zhang, Xiaofeng He

### DMMAM: Deep Multi-source Multi-task Attention Model for Intensive Care Unit Diagnosis

Disease diagnosis can provide crucial information for clinical decisions that influence the outcome in acute serious illness, and this is particularly in the intensive care unit (ICU). However, the central role of diagnosis in clinical practice is challenged by evidence that does not always benefit patients and that factors other than disease are important in determining patient outcome. To streamline the diagnostic process in daily routine and avoid misdiagnoses, in this paper, we proposed a deep multi-source multi-task attention model (DMMAM) for ICU disease diagnosis. DMMAM exploits multi-sources information from various types of complications, clinical measurements, and the medical treatments to support the diagnosis. We evaluate the proposed model with 50 diseases of 9 classifications on an extensive collection of real-world ICU Electronic Health Records (EHR) dataset with 151729 ICU admissions from 46520 patients. Experiments results demonstrate the effectiveness and the robustness of our model.

Zhenkun Shi, Wanli Zuo, Weitong Chen, Lin Yue, Yuwei Hao, Shining Liang

### Learning k-Occurrence Regular Expressions with Interleaving

Since lacking valid schemas is a critical problem for XML and present research on interleaving for XML is also quite insufficient, in this paper we focus on the inference of XML schemas with interleaving. Previous researches have shown that the essential task in schema learning is inferring regular expressions from a set of given samples. Presently, the most powerful model to learn XML schemas is the k-occurrence regular expressions (k-OREs for short). However, there have been no algorithms that can learn k-OREs with interleaving. Therefore, we propose an entire framework which can support both k-OREs and interleaving. To the best of our knowledge, our work is the first to address these two inference problems at the same time. We first defined a new subclass of regular expressions named k-OIREs, and developed an inference algorithm iKOIRE to learn k-OIRE based on genetic algorithm and maximum independent set (MIS). We further conducted a series of experiments on large-scale real datasets, and evaluated the effectiveness of our work compared with both ongoing learning algorithms in academia and industrial tools in real world. The results reveal the high practicability and outstanding performance of our work, and indicate its promising prospects in application.

Yeting Li, Xiaolan Zhang, Jialun Cao, Haiming Chen, Chong Gao

### Learning from User Social Relation for Document Sentiment Classification

Sentiment analysis is a fundamental problem in the field of natural language processing. Existing methods incorporate both semantics of texts and user-level information into deep neural networks to perform sentiment classification of social media documents. However, they ignored the relations between users which can serve as a crucial evidence for classification. In this paper, we propose SRPNN, a deep neural network based model to take user social relations into consideration for sentiment classification. Our model is based on the observation that social relations between users with similar sentiment trends provide important signals for deciding the polarity of words and sentences in a document. To make use of such information, we develop a user trust network based random walk algorithm to capture the sequence of users that have similar sentiment orientation. We then propose a deep neural network model to jointly learn the text representation and user social interaction. Experimental results on two popular real-world datasets show that our model significantly outperforms state-of-the-art methods.

Kangzhi Zhao, Yong Zhang, Yan Zhang, Chunxiao Xing, Chao Li

### Reinforcement Learning to Diversify Top-N Recommendation

In this paper, we study how to recommend both accurate and diverse top-N recommendation, which is a typical instance of the maximum coverage problem. Traditional approaches are to treat the process of constructing the recommendation list as a problem of greedy sequential items selection, which are inevitably sub-optimal. In this paper, we propose a reinforcement learning and neural networks based framework – Diversify top-N Recommendation with Fast Monte Carlo Tree Search (Div-FMCTS) – to optimize the diverse top-N recommendations in a global view. The learning of Div-FMCTS consists of two stages: (1) searching for better recommendation with MCTS; (2) generalizing those plans with the policy and value neural networks. Due to the difficulty of searching over extremely large item permutations, we propose two approaches to speeding up the training process. The first approach is pruning the branches of the search tree by the structure information of the optimal recommendations. The second approach is searching over a randomly chosen small subset of items to quickly harvest the fruits of searching in the generalization with neural networks. Its effectiveness has been proved both empirically and theoretically. Extensive experiments on four benchmark datasets have demonstrated the superiority of Div-FMCTS over state-of-the-art methods.

Lixin Zou, Long Xia, Zhuoye Ding, Dawei Yin, Jiaxing Song, Weidong Liu

### Retweeting Prediction Using Matrix Factorization with Binomial Distribution and Contextual Information

Retweeting provides an efficient way to expand information diffusion in social networks, and many methods have been proposed to model user’s retweeting behaviors. However, most of existing works focus on devising an effective prediction method based on social network data, and few research studies explore the data characteristic of retweeting behaviors which is typical binary discrete distribution and sparse data. To this end, we propose two novel retweeting prediction models, named Binomial Retweet Matrix Factorization (BRMF) and Context-aware Binomial Retweet Matrix Factorization (CBRMF). The two proposed models assume that retweetings are from binomial distributions instead of normal distributions given the factor vectors of users and messages, and then predicts the unobserved retweetings under matrix factorization. To alleviate data sparsity and reduce noisy information, CBRMF first learns user community by using community detection method and message clustering by using short texts clustering algorithm from social contextual information on the basis of homophily assumption, respectively. Then CBRMF incorporates the impacts of homophily characteristics on users and messages as two regularization terms into BRMF to improve the prediction performance. We evaluate the proposed methods on two real-world social network datasets. The experimental results show BRMF achieves better the prediction accuracy than normal distributions based matrix factorization model, and CBRMF outperforms existing state-of-the-art comparison methods.

Bo Jiang, Zhigang Lu, Ning Li, Jianjun Wu, Feng Yi, Dongxu Han

### Sparse Gradient Compression for Distributed SGD

Communication bandwidth is a bottleneck in distributed machine learning, and limits the system scalability. The transmission of gradients often dominates the communication in distributed SGD. One promising technique is using the gradient compression to reduce the communication cost. Recently, many approaches have been developed for the deep neural networks. However, they still suffer from the high memory cost, slow convergence and serious staleness problems over sparse high-dimensional models. In this work, we propose Sparse Gradient Compression (SGC) to efficiently train both the sparse models and the deep neural networks. SGC uses momentum approximation to reduce the memory cost with negligible accuracy degradation. Then it improves the accuracy with long-term gradient compensation, which maintains global momentum to make up for the information loss caused by the approximation. Finally, to alleviate the staleness problem, SGC updates model weight with the accumulation of delayed gradients at local, called local update technique. The experiments over the sparse high-dimensional models and deep neural networks indicate that SGC can compress 99.99% gradients for every iteration without performance degradation, and saves the communication cost up to 48 $$\times$$ .

Haobo Sun, Yingxia Shao, Jiawei Jiang, Bin Cui, Kai Lei, Yu Xu, Jiang Wang

### STDR: A Deep Learning Method for Travel Time Estimation

With the booming traffic developments, estimating the travel time for a trip on road network has become an important issue, which can be used for driving navigation, traffic monitoring, route planning, and ride sharing, etc. However, it is a challenging problem mainly due to the complicate spatial-temporal dependencies, external weather conditions, road types and so on. Most traditional approaches mainly fall into the sub-segments or sub-paths categories, in other words, divide a path into a sequence of segments or sub-paths and then sum up the sub-time, yet which don’t fit the real-world situations such as the continuously dynamical changing route or the waiting time at the intersections. To address these issues, in this paper, we propose an end to end Spatial Temporal Deep learning network with Road type named STDR to estimate the travel time based on historical trajectories and external factors. The model jointly leverages CNN and LSTM to capture the complex nonlinear spatial-temporal characteristics, more specifically, the convolutional layer extracts the spatial characteristics and the LSTM with attention mechanism extracts the time series characteristics. In addition, to better discover the influence of the road type, we introduce a road segmentation approach which is capable of dividing the trajectory based on the shape of trajectory. We conduct extensive verification experiments for different settings, and the results demonstrate the superiority of our method.

Jie Xu, Yong Zhang, Li Chao, Chunxiao Xing

### Using Fractional Latent Topic to Enhance Recurrent Neural Network in Text Similarity Modeling

Recurrent neural networks (RNNs) have been widely used in text similarity modeling for text semantic representation learning. However, referring to the classical topic models, a text contains many different latent topics, and the complete semantic information of the text is described by all the latent topics. Previous RNN based models usually learn the text representation with the separated words in the text instead of topics, which will bring noises and loss hierarchical structure information for text representation. In this paper, we proposed a novel fractional latent topic based RNN (FraLT-RNN) model, which focuses on the text representation in topic-level and largely preserve the whole semantic information of a text. To be specific, we first adopt the fractional calculus to generate latent topics for a text with the hidden states learned by a RNN model. Then, we propose a topic-wise attention gating mechanism and embed it into our model to generate the topic-level attentive vector for each topic. Finally, we reward the topic perspective with the topic-level attention for text representation. Experiments on four benchmark datasets, namely TREC-QA and WikiQA for answer selection, MSRP for paraphrase identification, and MultiNLI for textual entailment, show the great advantages of our proposed model.

Yang Song, Wenxin Hu, Liang He

### Efficiently Mining Maximal Diverse Frequent Itemsets

Given a database of transactions, where each transaction is a set of items, maximal frequent itemset mining aims to find all itemsets that are frequent, meaning that they consist of items that co-occur in transactions more often than a given threshold, and that are maximal, meaning that they are not contained in other frequent itemsets. Such itemsets are the most interesting ones in a meaningful sense. We study the problem of efficiently finding such itemsets with the added constraint that only the top-k most diverse ones should be returned. An itemset is diverse if its items belong to many different categories according to a given hierarchy of item categories. We propose a solution that relies on a purposefully designed index structure called the FP*-tree and an accompanying bound-based algorithm. An extensive experimental study offers insight into the performance of the solution, indicating that it is capable of outperforming an existing method by orders of magnitude and of scaling to large databases of transactions.

Dingming Wu, Dexin Luo, Christian S. Jensen, Joshua Zhexue Huang

### Efficient Local Search for Minimum Dominating Sets in Large Graphs

The Minimum Dominating Set (MinDS) problem is an NP-hard problem of great importance in both theories and applications. In this paper, we propose a new local search algorithm ScBppw (Score Checking and Best-picking with Probabilistic Walk) to solve the MinDS problem in large graphs. For diversifying the search, our algorithm exploits a tabu strategy, called Score Checking (SC), which forbids a vertex to be added into the current candidate solution if the vertex’s score has not been changed since the last time it was removed out of the candidate solution. Also, to keep a good balance between intensification and diversification during the search, we propose a strategy that combines, in a novel way, best-picking with probabilistic walk at removing stages. At this stage, the algorithm selects a vertex with the minimum loss, or other vertices in the candidate solution with a probability proportional to the their degrees, depending on how repeatedly the area has been visited. Experimental results show that our solver significantly outperforms state-of-the-art MinDS solvers. Also we conducted several experiments to show the individual impacts of our novelties.

Yi Fan, Yongxuan Lai, Chengqian Li, Nan Li, Zongjie Ma, Jun Zhou, Longin Jan Latecki, Kaile Su

### Multi-level Graph Compression for Fast Reachability Detection

Fast reachability detection is one of the key problems in graph applications. Most of the existing works focus on creating an index and answering reachability based on that index. For these approaches, the index construction time and index size can become a concern for large graphs. More recently query-preserving graph compression has been proposed and searching reachability over the compressed graph has been shown to be able to significantly improve query performance as well as reducing the index size. In this paper, we introduce a multilevel compression scheme for DAGs, which builds on existing compression schemes, but can further reduce the graph size for many real-world graphs. We propose an algorithm to answer reachability queries using the compressed graph. Extensive experiments with two existing state-of-the-art reachability algorithms and 10 real-world datasets demonstrate that our approach outperforms the existing methods.

Shikha Anirban, Junhu Wang, Md. Saiful Islam

### Multiple Privacy Regimes Mechanism for Local Differential Privacy

Local differential privacy (LDP), as a state-of-the-art privacy notion, enables users to share protected data safely while the private real data never leaves user’s device. The privacy regime is one of the critical parameters balancing between the correctness of the statistical result and the level of user’s privacy. In the majority of current work, authors assume that the privacy regime is totally determined by the service provider and dispatched to all users. However, it is inelegant and unpromising for all users to accept the same privacy level in real world. In this paper, we propose a new LDP estimation method MLE which is applicable for the scenario of multiple privacy regimes. MLE uses the idea of parameter estimation to merge the results generated by users of different privacy levels. We also propose an extension of MLE to handle the situation when all users’ regimes are in a continuous distribution. We also provide an Adapt estimator which assigns users to use different LDP schemes based on their regimes, and it performs better than the estimator with only one fixed LDP scheme. Experiments show that our methods provide a higher level of accuracy than previous proposals in this multiple regimes scenario.

Yutong Ye, Min Zhang, Dengguo Feng, Hao Li, Jialin Chi

### Privacy Preserving Elastic Stream Processing with Clouds Using Homomorphic Encryption

Prevalence of the Infrastructure as a Service (IaaS) clouds has enabled organizations to elastically scale their stream processing applications to public clouds. However, current approaches for elastic stream processing do not consider the potential security vulnerabilities in cloud environments. In this paper we describe the design and implementation of an Elastic Switching Mechanism for data stream processing which is based on Homomorphic Encryption (HomoESM). The HomoESM not only does elastically scale data stream processing applications into public clouds but also preserves the privacy of such applications. Using a real world test setup, which includes an email filter benchmark and a web server access log processor benchmark (EDGAR) we demonstrate the effectiveness of our approach. Multiple experiments on Amazon EC2 indicate that the proposed approach for Homomorphic encryption provides significant results which is 10% to 17% improvement of average latency in the case of email filter benchmark and EDGAR benchmarks respectively. Furthermore, EDGAR add/subtract operations and comparison operations showed 6.13% and 26.17% average latency improvements respectively. These promising results pave the way for real world deployments of privacy preserving elastic stream processing in the cloud.

Arosha Rodrigo, Miyuru Dayarathna, Sanath Jayasena

### Select the Best for Me: Privacy-Preserving Polynomial Evaluation Algorithm over Road Network

Recent years have witnessed the rapid advances in location based services (LBSs) with the fast development of mobile devices and communication technologies. Outsourcing spacial databases to Cloud provides an economical and flexible way for the LBS provider to deploy services. For an LBS enterprise, the collected data might be its most valuable strategic asserts. However, in the outsourced database paradigm, the third-party database server at cloud is not completely trustworthy, therefore, protecting data privacy is critical. A polynomial evaluation algorithm over road network returns top-k results (restaurant) to a user (tourist). It is a fundamental and important query mode has been widely used in LBS applications. In this paper, we extend the Order-Revealing Encryption (ORE) to design a privacy-preserving polynomial evaluation algorithm over the spacial data with security guarantee. To reduce the computation overhead, we introduce an influence model to store the encrypted point data (restaurant). Besides the stronger security, this work is a practical polynomial evaluation algorithm over the road network. The practicality is mainly manifested in two aspects: evaluation over multiple attributes, and the dynamic evaluation function rather than fixed function in advance. We formally prove the security of our scheme in the random oracle model. Finally, we implement a prototype to evaluate the performance of our scheme. The experimental results over the real road network dataset demonstrate that the proposed scheme is an efficient and practical polynomial evaluation algorithm for the LBS applications.

Wei Song, Chengliang Shi, Yuan Shen, Zhiyong Peng

### Recommendation

#### Frontmatter

User preferences are dynamic and diverse in real world, while historical preference of a user may not be equally important as current preference when predicting future interests. As a result, learning the evolving user representation effectively becomes a critical problem in personalized recommendation. However, existing recommendation solutions often use a fixed user representation, which is not capable of modeling the complex interests of users. To this end, we propose a novel metric learning approach named Adaptive Collaborative Metric Learning (AdaCML) for recommendation. AdaCML employs a memory component and an attention mechanism to learn an adaptive user representation, which dynamically adapts to locally activated items. In this way, implicit relationships of user-item pairs can be better determined in the metric space and users’ interests can be modeled more accurately. Comprehensive experimental results demonstrate the effectiveness of AdaCML on two datasets, and show that AdaCML outperforms competitive baselines in terms of Precision, Recall, and Normalized Discounted Cumulative Gain (NDCG).

Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Jiajie Xu, Junhua Fang, Lei Zhao, Victor S. Sheng, Zhiming Cui

### Adaptive Attention-Aware Gated Recurrent Unit for Sequential Recommendation

Due to the dynamic and evolutionary characteristics of user interests, sequential recommendation plays a significant role in recommender systems. A fundamental problem in the sequential recommendation is modeling dynamic user preference. Recurrent Neural Networks (RNNs) are widely adopted in the sequential recommendation, especially attention-based RNN becomes the state-of-the-art solution. However the existing fixed attention mechanism is insufficient to model the dynamic and evolutionary characteristics of user sequential preferences. In this work, we propose a novel solution, Adaptive Attention-Aware Gated Recurrent Unit (3AGRU), to learn adaptive user sequential representations for sequential recommendation. Specifically, we adopt an attention mechanism to adapt the representation of user sequential preference, and learn the interaction between steps and items from data. Moreover, in the first level of 3AGRU, we construct adaptive attention network to describe the relevance between input and the candidate item. In this way, a new input based on adaptive attention can reflect users’ diverse interests. Then, the second level of 3AGRU applies adaptive attention network to hidden state level to learn a deep user representation which is able to express diverse interests of the user. Finally, we evaluate the proposed model using three real-world datasets from various application scenarios. Our experimental results show that our model significantly outperforms the state-of-the-art approaches on sequential recommendation.

Anjing Luo, Pengpeng Zhao, Yanchi Liu, Jiajie Xu, Zhixu Li, Lei Zhao, Victor S. Sheng, Zhiming Cui

### Attention and Convolution Enhanced Memory Network for Sequential Recommendation

The sequential recommendation, which models sequential behavioral patterns among users for the recommendation, plays a critical role in recommender systems. Conventionally, user general taste and recent demand are combined to promote recommendation performance. However, existing methods usually neglect that user long-term preference keeps evolving over time and only use a static user embedding to model the general taste. Moreover, they often ignore the feature interactions when modeling short-term sequential patterns and integrate user-item or item-item interactions through a linear way, which limits the capability of model. To this end, we propose an $$\mathbf {A}$$ ttention and $$\mathbf {C}$$ onvolution enhanced memory network for $$\mathbf {S}$$ equential $$\mathbf {R}$$ ecommendation (ACSR) in this paper. Specifically, an attention layer learns user’s general preference, while the convolutional layer searches for feature interactions and sequential patterns to capture user’s sequential preference. Moreover, the outputs of the attention layer and the convolutional layer are concatenated and fed into a fully-connected layer to generate the recommendation. This approach provides a unified and flexible network structure for capturing both general taste and sequential preference. Finally, we evaluate our model on two real-world datasets. Extensive experimental results show that our model ACSR outperforms the state-of-the-art approaches.

Jian Liu, Pengpeng Zhao, Yanchi Liu, Jiajie Xu, Junhua Fang, Lei Zhao, Victor S. Sheng, Zhiming Cui

### Attention-Based Neural Tag Recommendation

Personalized tag recommender systems suggest tags to users when annotating specific items. Usually, recommender systems need to take both users’ preference and items’ features into account. Existing methods like latent factor models based on tensor factorization use low-dimensional dense vectors to represent latent features of users, items and tags. The problem with these models is using the static representation for the user, which neglects that users’ preference keeps evolving over time. Other methods based on base-level learning (BLL) only use a simple time-decay function to weight users’ preference. In this paper, we propose a personalized tag recommender system based on neural networks and attention mechanism. This approach utilizes the multi-layer perceptron to model the non-linearities of interactions among users, items and tags. Also, an attention network is introduced to capture the complex pattern of the user’s tagging sequence. Extensive experiments on two real-world datasets show that the proposed model outperforms the state-of-the-art tag recommendation method.

Jiahao Yuan, Yuanyuan Jin, Wenyan Liu, Xiaoling Wang

### Density Matrix Based Preference Evolution Networks for E-Commerce Recommendation

In e-commerce platforms, mining temporal characteristics in user behavior is conducive to recommend the right product for the user at the right time. Recently, recurrent neural networks (RNNs) based methods have achieved profitable performance in exploring temporal features, however, in complex e-commerce scenarios, user preferences changing over time have not been fully exploited. In order to fill the gap, we propose a novel representation for user preferences with the inspiration of a quantum concept, density matrix. It encodes a mixture of item subspaces and represents distribution of user preferences at one time stamp. Further, such a representation and RNNs are combined to form our proposed Density Matrix based Preference Evolution Networks (DMPENs). Experiments on Amazon datasets as well as real-world e-commerce datasets demonstrate the effectiveness of the proposed methods, which achieve rapid convergence and superior performance compared with the state-of-the-art methods in terms of AUC and accuracy.

Panpan Wang, Zhao Li, Xuming Pan, Donghui Ding, Xia Chen, Yuexian Hou

### Multi-source Multi-net Micro-video Recommendation with Hidden Item Category Discovery

As the sheer volume of available micro-videos often undermines the users’ capability to choose the micro-videos, in this paper, we propose a multi-source multi-net micro-video recommendation model that recommends micro-videos fitting users’ best interests. Different from existing works, as micro-video inherits the characteristics of social platforms, we simultaneously incorporate multi-source content data of items and multi-networks of users to learn user and item representations for recommendation. This information can be complementary to each other in a way that multi-modality data can bridge the semantic gap among items, while multi-type user networks, such as following and reposting, are able to propagate the preferences among users. Furthermore, to discover the hidden categories of micro-videos that properly match users’ interests, we interactively learn the user-item representations. The resulted categorical representations are interacted with user representations to model user preferences at different level of hierarchies. Finally, multi-source content item data, multi-type user networks and hidden item categories are jointly modelled in a unified recommender, and the parameters of the model are collaboratively learned to boost the recommendation performance. Experiments on a real dataset demonstrate the effectiveness of the proposed model and its advantage over the state-of-the-art baselines.

Jingwei Ma, Jiahui Wen, Mingyang Zhong, Weitong Chen, Xiaofang Zhou, Jadwiga Indulska

### Incorporating Task-Oriented Representation in Text Classification

Text classification (TC) is an important task in natural language processing. Recently neural network has been applied to text classification and achieves significant improvement in performance. Since some documents are short and ambiguous, recent research enriches document representation with concepts of words extracted from an external knowledge base. However, this approach might incorporate task-irrelevant concepts or coarse granularity concepts that could not discriminate classes in a TC task. This might add noise to document representation and degrade TC performance. To tackle this problem, we propose a task-oriented representation that captures word-class relevance as task-relevant information. We integrate task-oriented representation in a CNN classification model to perform TC. Experimental results on widely used datasets show our approach outperforms comparison models.

Xue Lei, Yi Cai, Jingyun Xu, Da Ren, Qing Li, Ho-fung Leung

### Music Playlist Recommendation with Long Short-Term Memory

Music playlist recommendation is an important component in modern music streaming services, which is used for improving user experience by regularly pushing personalized music playlists based on users’ preferences. In this paper, we propose a novel music playlist recommendation problem, namely Personalized Music Playlist Recommendation (PMPR), which aims to provide a suitable playlist for a user by taking into account her long/short-term preferences and music contextual data. We propose a data-driven framework, which is comprised of two phases: user/music feature extraction and music playlist recommendation. In the first phase, we adopt a matrix factorization technique to obtain long-term features of users and songs, and utilize the Paragraph Vector (PV) approach, an advanced natural language processing technique, to capture music context features, which are the basis of the subsequent music playlist recommendation. In the second phase, we design two Attention-based Long Short-Term Memory (AB-LSTM) models, i.e., typical AB-LSTM model and Improved AB-LSTM (IAB-LSTM) model, to achieve the suitable personalized playlist recommendation. Finally, we conduct extensive experiments using a real-world dataset, verifying the practicability of our proposed methods.

Huiping Yang, Yan Zhao, Jinfu Xia, Bin Yao, Min Zhang, Kai Zheng

### Online Collaborative Filtering with Implicit Feedback

Studying recommender systems with implicit feedback has become increasingly important. However, most existing works are designed in an offline setting while online recommendation is quite challenging due to the one-class nature of implicit feedback. In this paper, we propose an online collaborative filtering method for implicit feedback. We highlight three critical issues of existing works. First, when positive feedback arrives sequentially, if we treat all the other missing items for this given user as the negative samples, the mis-classified items will incur a large deviation since some items might appear as the positive feedback in the subsequent rounds. Second, the cost of missing a positive feedback should be much higher than that of having a false-positive. Third, the existing works usually assume that a fixed model is given prior to the learning task, which could result in poor performance if the chosen model is inappropriate. To address these issues, we propose a unified framework for Online Collaborative Filtering with Implicit Feedback (OCFIF). Motivated by the regret aversion, we propose a divestiture loss to heal the bias derived from the past mis-classified negative samples. Furthermore, we adopt cost-sensitive learning method to efficiently optimize the implicit MF model without imposing a heuristic weight restriction on missing data. By leveraging meta-learning, we dynamically explore a pool of multiple models to avoid the limitations of a single fixed model so as to remedy the drawback of manual/heuristic model selection. We also analyze the theoretical bounds of the proposed OCFIF method and conduct extensive experiments to evaluate its empirical performance on real-world datasets.

Jianwen Yin, Chenghao Liu, Jundong Li, BingTian Dai, Yun-chen Chen, Min Wu, Jianling Sun

### Subspace Ensemble-Based Neighbor User Searching for Neighborhood-Based Collaborative Filtering

Neighborhood-based collaborative filtering (NCF) typically uses a similarity measure for finding similar users to a target user or similar products on which the target user rated. To find neighbor users, traditional similarity measures rely only on the ratings of co-rated items when calculating similarity of pairwise users. Some hybrid similarity measures can avoid this situation but they suffer from the time-consuming issue. To solve the mentioned issues, the current paper presents an effective method of subspace ensemble-based neighbor user searching (SENUS) for NCF. First, three item subspaces are constructed, or interested, neither interested nor uninterested, and uninterested subspaces. In each subspace, we calculate the co-rating support values for pairwise users. Then, SENUS combines three co-rating support values to get the total co-rating support values for pairwise users, which are utilized to generate direct neighbor users for a target user. For the target user, its neighbor users include direct and indirect ones in SENUS, where its indirect neighbors are the direct neighbors of its direct neighbors. Experimental results on public datasets indicate that the proposed method is promising in recommender systems.

Zepeng Li, Li Zhang

### Towards both Local and Global Query Result Diversification

Query result diversification is critical for improving users’ query satisfaction by making the top ranked results cover more different query semantics. The state-of-the-art works address the problem via bi-criteria (namely, relevance and dissimilarity) optimization. However, such works only consider how dissimilar the returned results are to each other, which is referred to “local diversity”. In contrast, some works consider how similar the not returned results are to the returned results, which is referred to “global diversity”, and however need a user defined threshold to predicate whether a result set is diverse. In this paper, we extend the traditional bi-criteria optimization problem to a tri-criteria problem that considers both local diversity and global diversity. For that, we formally define the metrics of global diversity and global-and-local diversity. Then, we prove the NP-hardness of the proposed problems, and propose two heuristic algorithms, greedy search and vertex substitution, and sophisticated optimization techniques to solve the problems efficiently. To evaluate our approach, we perform comprehensive experiments on three real datasets. The results demonstrate that our approach can indeed find more reasonably diversified results. Moreover, our greedy search algorithm can significantly reduce the time cost by leveraging the critical object, and then our vertex substitution algorithm can incrementally improve the objective value of results returned by greedy search with extra time cost.

Ming Zhong, Huanyu Cheng, Ying Wang, Yuanyuan Zhu, Tieyun Qian, Jianxin Li

### Structured Spectral Clustering of PurTree Data

Recently, a “Purchase Tree” data structure is proposed to compress the customer transaction data and a local PurTree Spectral clustering method is proposed to recover the cluster structure from the purchase trees. However, in the PurTree distance, the node weights for the children nodes of a parent node are set as equal and the difference between different nodes are not distinguished. In this paper, we propose a Structured PurTree Subspace Spectral (SPSS) clustering algorithm for PurTree Data. In the new method, we propose a PurTree subspace similarity to compute the similarity between two trees, in which a set of sparse and structured node weights are introduced to distinguish the importance of different nodes in a purchase tree. A new clustering model is proposed to learn a structured graph with explicit cluster structure. An iterative optimization algorithm is proposed to simultaneously learn the structured graph and node weights. We propose a balanced cover tree for fast k-NN searching during building affinity matrices. SPSS was compared with six clustering algorithms on 10 benchmark data sets and the experimental results show the superiority of the new method.

Xiaojun Chen, Chao Guo, Yixiang Fang, Rui Mao

### Dynamic Stochastic Block Model with Scale-Free Characteristic for Temporal Complex Networks

Complex network analysis has been widely applied in various fields such as social system, information system, and biological system. As the most popular model for analyzing complex network, Stochastic Block Model can perform network reconstruction, community detection, link prediction, anomaly detection, and other tasks. However, for the dynamic complex networks which are always modeling as a series of snapshot networks, the existing works for dynamic networks analysis which are based on the stochastic block model always analyze the evolution of dynamic networks by introducing probability transition matrix, then, the scale-free characteristic (power law of the degree distribution) of the network, is ignoring. So in order to overcome this limitation, we propose a fully Bayesian generation model, which incorporates the heterogeneity of the degree of nodes to model dynamic complex networks. Then we present a new dynamic stochastic block model for community detection and evolution tracking under a unified framework. We also propose an effective variational inference algorithm to solve the proposed model. The model is tested on the simulated datasets and the real-world datasets, and the experimental results show that the performance of it is superior to the baselines of community detection in dynamic networks.

Xunxun Wu, Pengfei Jiao, Yaping Wang, Tianpeng Li, Wenjun Wang, Bo Wang

### In Good Company: Efficient Retrieval of the Top-k Most Relevant Event-Partner Pairs

The proliferation of event-based social networking (ESBN) motivates a range of studies on topics such as event, venue, and friend recommendation and event creation and organization. In this setting, the notion of event-partner recommendation has recently attracted attention. When recommending an event to a user, this functionality allows recommendation of partner with whom to attend the event. However, existing proposals are push-based: recommendations are pushed to users at the system’s initiative. In contrast, EBSNs provide users with keyword-based search functionality. This way, users may retrieve information in pull mode. We propose a new way of accessing information in EBSNs that combines push and pull, thus allowing users to not only conduct ad-hoc searches for events, but also to receive partner recommendations for retrieved events. Specifically, we define and study the top-k event-partner (kEP) pair retrieval query that integrates event-partner recommendation and keyword-based search for events. The query retrieves event-partner pairs, taking into account the relevance of events to user-supplied keywords and so-called together preferences that indicate the extent of a user’s preference to attend an event with a given partner. In order to compute kEP queries efficiently, we propose a rank-join based framework with three optimizations. Results of empirical studies with implementations of the proposed techniques demonstrate that the proposed techniques are capable of excellent performance.

Dingming Wu, Yi Zhu, Christian S. Jensen

### Local Experts Finding Across Multiple Social Networks

The local experts finding, which aims to identify a set of k people with specialized knowledge around a particular location, has become a hot topic along with the popularity of social networks, such as Twitter, Facebook. Local experts are important for many applications, such as answering local information queries, personalized recommendation. In many real-world applications, complete social information should be collected from multiple social networks, in which people usually participate in and active. However, previous approaches of local experts finding mostly focus on a single social network. In this paper, as far as we know, we are the first to study the local experts finding problem across multiple large social networks. Specifically, we want to identify a set of k people with the highest score, where the score of a person is a combination of local authority and topic knowledge of the person. To efficiently tackle this problem, we propose a novel framework, KTMSNs (knowledge transfer across multiple social networks). KTMSNs consists of two steps. Firstly, given a person over multiple social networks, we calculate the local authority and the topic knowledge, respectively. We propose a social topology-aware inverted index to speed up the calculation of the two values. Secondly, we propose a skyline-based strategy to combine the two values for obtaining the score of a person. Experimental studies on real social network datasets demonstrate the efficiency and effectiveness of our proposed approach.

Yuliang Ma, Ye Yuan, Guoren Wang, Yishu Wang, Delong Ma, Pengjie Cui

### SBRNE: An Improved Unified Framework for Social and Behavior Recommendations with Network Embedding

Recent years have witnessed the fast growing and ubiquity of social media which has significantly changed the social manner and information sharing in our daily life. Given a user, social (i.e. friend) recommendation and behavior (i.e. item) recommendation are two types of popular services in social media applications. Despite the extensive studies, few existing work has addressed both tasks elegantly and effectively. In this paper, we propose an improved unified framework for Social and Behavior Recommendations with Network Embedding (SBRNE for short). With modeling social and behavior information simultaneously, SBRNE integrates social recommendation and behavior recommendation into a unified framework. By employing users’ latent interests as a bridge, social and behavior information is modeled effectively to improve performance of social and behavior recommendations all together. In addition, an efficient network embedding procedure is introduced as a pre-training step for users’ latent representations to improve effectiveness and efficiency of recommendation tasks. Extensive experiments on real-world datasets demonstrate the effectiveness of SBRNE.

Weizhong Zhao, Huifang Ma, Zhixin Li, Xiang Ao, Ning Li

### User Intention-Based Document Summarization on Heterogeneous Sentence Networks

Automatic extraction-based document summarization is a difficult Natural Language Processing task. Previous approaches have usually generated the summary by extracting the top K salient sentences on graph-based ranking algorithms, but sentence feature representation only captures the surface relationship between the objects, hence the results may not accurately reflect the user’s intentions. Therefore, we propose a method to address this challenge, and: (1) obtain deeper semantic concepts among candidate sentences using meaningful sentence vectors combining word vectors and TF-IDF; (2) rank the sentences considering both relationships between sentences and the user’s intention for each sentence to identify significant sentences, and apply these to a heterogeneous graph; (3) generate the result sentence by sentence to ensure summary semantics are properly related to the original document. We verified the proposed approach experimentally using English summarization benchmark datasets DUC2001 and DUC2002; the large Chinese summarization data set, LCSTS. We also collected news data and produced a reference summary using a group of bank auditor experts that we compared to the proposed approach using ROUGE evaluation.

Hsiu-Yi Wang, Jia-Wei Chang, Jen-Wei Huang

### A Hierarchical Index Structure for Region-Aware Spatial Keyword Search with Edit Distance Constraint

Location-based services have become widely available on a variety of devices. Due to the errors in user input as well as geo-textual databases, supporting error-tolerant spatio-textual search becomes an important problem in the field of spatial keyword search. Edit distance is the most widely used metrics to capture typographical errors. However, existing techniques for spatio-textual similarity query mainly focused on the set based textual relevance, but they cannot work well for edit distance due to the lack of filter power, which would involve larger overhead of computing edit distance. In this paper, we propose a novel framework to solve the region aware top- $$k$$ similarity search problem with edit distance constraint. We first propose a hierarchical index structure to capture signatures of both spatial and textual relevance. We then utilize the prefix filter techniques to support top- $$k$$ similarity search on the index. We further propose an estimation based method and a greedy search algorithm to make full use of the filter power of the hierarchical index. Experimental results on real world POI datasets show that our method outperforms state-of-the-art methods by up to two orders of magnitude.

Junye Yang, Yong Zhang, Huiqi Hu, Chunxiao Xing

### Collective POI Querying Based on Multiple Keywords and User Preference

Point-of-interest (POI) querying, which searches and recommends visiting places for individuals with context constraints, has recently become a very popular location-based service. The existing approaches, however, mainly focus on finding a single POI instead of a group of POIs that are neibouring with each other. Some few approaches do handle the querying of collective POIs, but fail to consider users’ preference. In this paper, we devise a novel approach which aims to retrieve collective POIs based on multiple keywords given by a user as well as user preference, POI popularity and congestion. In addition, we design a cost function to calculate the visit cost of the candidate POIs. We also propose an efficient algorithm based on IR-tree which finds the optimal solution to achieve the balance between multiple optimization targets. The extensive experiments based on the real data from Toronto Canada demonstrate the effectiveness and efficiency of our approach.

Dongjin Yu, Yiyu Wu, Chengfei Liu, Xiaoxiao Sun

### DPSCAN: Structural Graph Clustering Based on Density Peaks

Structural graph clustering is one of the fundamental problems in managing and analyzing graph data. The structural clustering algorithm SCAN is successfully used in many applications because it obtains not only clusters but also hubs and outliers. However, the results of SCAN heavily depend on two sensitive parameters, $$\epsilon$$ and $$\mu$$ , which may result in loss of accuracy and efficiency. In this paper, we propose a novel Density Peak-based Structural Clustering Algorithm for Networks (DPSCAN). Specifically, DPSCAN clusters vertices based on the structural similarity and the structural dependency between vertices and their neighbors, without tuning parameters. Through theoretical analysis, we prove that DPSCAN can detect meaningful clusters, hubs and outliers. In addition, to improve the efficiency of DPSCAN, we propose a new index structure named DP-Index, so that each vertex needs to be visited only once. Finally, we conduct comprehensive experimental studies on real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches.

Changfa Wu, Yu Gu, Ge Yu

### Efficient Processing of Spatial Group Preference Queries

POIs (points of interest) as well as users’ check-in information and their ratings on POIs have been widely studies in systems providing location-based services. We note that users usually have their own preferences for POI categories and their own network of friends. Therefore, we aim to provide for a group of users a new kind of POI-finding query that considers not only POI preferences of each user but also other aspects of location-based social networks such as users’ locations and POI ratings. We name such a new query as Spatial Group Preference (SGP) query. For a group of users, an SGP query returns top-k POIs that are much likely to satisfy the needs of users. Specially, we propose a new evaluation model that considers user preferences for user preferences for POI categories. Based on this model, we develop basic algorithms based on R-tree to evaluate SGP queries. Further, we design a new index structure called CR-tree to accelerate the query performance. We prove that CR-tree has better pruning efficiency than the traditional R-tree. We conduct experiments on a simulation dataset as well as two real datasets with respect to various configurations. The results suggest the efficiency of our proposal.

Zhou Zhang, Peiquan Jin, Yuan Tian, Shouhong Wan, Lihua Yue

### Reverse-Auction-Based Competitive Order Assignment for Mobile Taxi-Hailing Systems

Mobile Taxi-Hailing (MTH) is one of the most attractive smartphone applications, through which passengers can reserve taxis ahead for their travels so that the taxi service’s efficiency can improve significantly. The taxi-hailing order assignment is an important component of MTH systems. Current MTH order assignment mechanisms fall short in flexibility and personalized pricing, resulting in an unsatisfactory service experience. To address this problem, we introduce a Competitive Order Assignment (COA) framework for the MTH systems. The COA framework mainly consists of the Multi-armed-bandit Automatic Valuation (MAV) mechanism and the Reverse-auction-based Order Assignment (ROA) mechanism. The taxis apply the MAV mechanism to automatically generate the transport service valuations for orders. The platform applies the ROA mechanism to complete each round of order assignment. Then, we analyze the online performance of MAV, and prove that ROA satisfies the properties of truthfulness and individual rationality. Finally, we also demonstrate the significant performances of MAV and ROA through extensive simulations on a real trace.

Hui Zhao, Mingjun Xiao, Jie Wu, An Liu, Baoyi An

### Top-K Spatio-Topic Query on Social Media Data

With the development of social media and GPS-enabled devices, people can search for what they are interested in more easily. There are many methods, such as spatial keyword query, proposed to help people get useful information. However, most existing methods are based on location and keywords query which neglect the semantic information. In this paper, we propose a new approach named Top-K Spatio-Topic Query (TKSTQ), which takes semantic information into consideration. We use a topic model to obtain topics of texts and organize index based on topic and location. In this way, the query results can satisfy people’s requirements better. The experimental results on a real dataset validate that our methods can significantly improve the relevance between result and query.

Lianming Zhou, Xuanhao Chen, Yan Zhao, Kai Zheng

### A Frequency-Aware Spatio-Temporal Network for Traffic Flow Prediction

Predicting traffic flow is crucial for transportation management and resource allocation, which has attracted more and more attention from researchers. The traffic flow in a city generally changes over time periods but always exhibits certain periodicity. Previous works focused on modeling spatial and temporal correlations using convolutional and recurrent neural networks respectively. Typically, a method that can effectively absorb more time-interval inputs and integrate more periodic information will achieve better performance. In this paper, we propose a Frequency-aware Spatio-temporal Network (FASTNet) for traffic flow prediction. In addition to modeling the spatio-temporal correlations, we dynamically filter the inputs to explicitly incorporate frequency information for traffic prediction. By applying Discrete Fourier Transform (DFT) on traffic flow, we obtain the spectrum of traffic flow sequence which reflects certain travel patterns of passengers. We then adopt a frequency-based filtering mechanism to filter the traffic flow series based on the explored spectrum information. To utilize the filtered tensor, a 3D convolutional network is designed to extract the spatio-temporal features automatically. Inspired by the frequency spectrum of traffic flows, this spatio-temporal convolutional network has various kernels with different sizes on temporal dimension, which models the temporal correlations with multi-scale frequencies. The final prediction layer summarizes the spatio-temporal features extracted by the spatio-temporal convolutional network. Our model outperforms the state-of-the-art methods through extensive experiments on three real datasets for citywide traffic flow prediction.

Shunfeng Peng, Yanyan Shen, Yanmin Zhu, Yuting Chen

### Efficient Algorithms for Solving Aggregate Keyword Routing Problems

With the emergence of smart phones and the popularity of GPS, the number of point of interest (POIs) is growing rapidly and spatial keyword search based on POIs has attracted significant attention. In this paper, we study a more sophistic type of spatial keyword searches that considers multiple query points and multiple query keywords, namely Aggregate Keyword Routing (AKR). AKR looks for an aggregate point m together with routes from each query point to m. The aggregate point has to satisfy the aggregate keywords, the routes from query points to the aggregate point have to pass POIs in order to complete the tasks specified by the task keywords, and the result route is expected to be the optimal one among all the potential results. In order to process AKR queries efficiently, we propose effective search algorithms, which support different aggregate functions. A comprehensive evaluation has been conducted to evaluate the performance of these algorithms with real datasets.

Qize Jiang, Weiwei Sun, Baihua Zheng, Kunjie Chen

### Perceiving Topic Bubbles: Local Topic Detection in Spatio-Temporal Tweet Stream

Local topic detection is an important task for many applications such as local event discovery, activity recommendation and emergency warning. Recent years have witnessed growing interest in leveraging spatio-temporal social media (eg. Twitter) for local topic detection. However, existing methods overlook the continuity of time and location, which is quite important and useful for local topic detection. For example, tweets posted at adjacent time and location should be considered correlated instead of isolated. To address this challenge, we propose a multi-layer heterogeneous network based embedding learner to preserve vicinity correlation as well as co-occurrence correlation, and map all the location, time, and keywords into a same latent space. Based on the heterogeneous network embedding, we develop a Bayesian mixture model to find local topics without specifying the number of topics in advance. Moreover, tweets are frequently updated, thus, we adopt an incremental update strategy to process continuous tweet stream in real time. The extensive experiments on real-world data sets demonstrate that our method outperforms the state-of-the-art existing methods.

Junsha Chen, Neng Gao, Cong Xue, Chenyang Tu, Daren Zha

### Real-Time Route Planning and Online Order Dispatch for Bus-Booking Platforms

To cater to the high travel demands in Beijing Capital International Airport during 23:00–2:00, Beijing Traffic Management Bureau (BTMB) intends to develop a new service named bus-booking platforms. Compared to traditional airport shuttle buses, bus-booking platforms can conduct flexible route planning and online order dispatch while provide much lower price than the common car-hailing platform, e.g., Didi Chuxing. We conduct the real-time route planning by solving the standard Capacitated Vehicles Routing Problem based on the order prediction. In addition, we focus on the design of the online order dispatch algorithm for bus-booking platforms, which is novel and extremely different from the traditional taxi order dispatch in car-hailing platforms. When an order is dispatched, multiple influence factors will be considered simultaneously, such as the bus capacity, the balanced distribution of the accepted orders and the travel time of passengers, all of which aim to provide a better user experience. Moreover, we prove that our online algorithms can achieve the tight competitive ratio in this paper, where the competitive ratio is the ratio between the solution of an online algorithm and the offline optimal solution.

Hao Zhou, Yucen Gao, Xiaofeng Gao, Guihai Chen

### STL: Online Detection of Taxi Trajectory Anomaly Based on Spatial-Temporal Laws

Aiming to promote the standardization of taxi services and protect the interests of passengers, many methods have been proposed to detect taxi trajectory anomaly based on collected large-scale taxi traces. However, most existing methods usually employ a counting-based policy to differentiate normal trajectories from anomalous ones, which may give rise to high false positives. In this paper, we propose an online detection method, named Spatial-Temporal Laws (STL). The basic idea of STL is that, given the displacement from the source point to the current point of a testing trajectory, if the current point is normal, either its driving distance or driving time should lie in a normal range. STL learns the two ranges from historical trajectories by defining two spatial-temporal models: one characterizing the relationship between displacement and driving distance, and another depicting the relationship between displacement and driving time. Consequently, STL is more precise compared with the counting-based methods, greatly reducing the number of false positives. Based on large-scale real-world taxi traces, STL is evaluated through a series of experiments which demonstrate its effectiveness and performance.

Bin Cheng, Shiyou Qian, Jian Cao, Guangtao Xue, Jiadi Yu, Yanmin Zhu, Minglu Li, Tao Zhang

### Backmatter

Weitere Informationen