Skip to main content
main-content

Über dieses Buch

This book constitutes the workshop proceedings of the 24th International Conference on Database Systems for Advanced Applications, DASFAA 2019, held in Chiang Mai, Thailand, in April 2019.

The 14 full papers presented were carefully selected and reviewed from 26 submissions to the three following workshops: the 6th International Workshop on Big Data Management and Service, BDMS 2019; the 4th International Workshop on Big Data Quality Management, BDQM 2019; and the Third International Workshop on Graph Data Management and Analysis, GDMA 2019. This volume also includes the short papers, demo papers, and tutorial papers of the main conference DASFAA 2019.

Inhaltsverzeichnis

Frontmatter

The 6th International Workshop on Big Data Management and Service (BDMS 2019)

Frontmatter

A Probabilistic Approach for Inferring Latent Entity Associations in Textual Web Contents

Latent entity associations (EA) represent that two entities associate with each other indirectly through multiple intermediate entities in different textual Web contents (TWCs) including e-mails, Web news, social network pages, etc. In this paper, by adopting Bayesian Network as the framework to represent and infer latent EAs as well as the probabilities of associations, we propose the concept of entity association Bayesian Network (EABN). To construct EABN efficiently, we employ self-organizing map for TWC dataset division to make the co-occurrence-based dependence of each pair of entities concern just a small set of documents. Using probabilistic inferences of EABN, we evaluate and rank EAs in all possible entity pairs, by which novel latent EAs could be found. Experimental results show the effectiveness and efficiency of our approach.

Lei Li, Kun Yue, Binbin Zhang, Zhengbao Sun

UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression

Anonymization method, as a kind of privacy protection technology for data publishing, has been heavily researched during the past twenty years. However, fewer researches have been conducted on making better use of the anonymized data for data mining. In this paper, we focus on training regression model using anonymized data and predicting on original samples using the trained model. Anonymized training instances are generally considered as hyper-rectangles, which is different from most machine learning tasks. We propose several hyper-rectangle vectorization methods that are compatible with both anonymized data and original data for model training. Anonymization brings additional uncertainty. To address this issue, we propose an Uncertainty-based Hyper-Rectangle Pruning method (UHRP) to reduce the disturbance introduced by anonymized data. In this method, we prune hyper-rectangle by its global uncertainty which is calculated from all uncertain attributes. Experiments show that a linear regressor trained on anonymized data could be expected to do as well as the model trained with original data under specific conditions. Experimental results also prove that our pruning method could further improve the model’s performance.

Kun Liu, Wenyan Liu, Junhong Cheng, Xingjian Lu

Meta-path Based MiRNA-Disease Association Prediction

Predicting the association of miRNA with disease is an important research topic of bioinformatics. In this paper, a novel meta-path based approach MPSMDA is proposed to predict the association of miRNA-disease. MPSMDA uses experimentally validated data to build a miRNA-disease heterogeneous information network (MDHIN). Thus, miRNA-disease association prediction is transformed into a link prediction problem on a MDHIN. Meta-path based similarity is used to measure the miRNA-disease associations. Since different meta-paths between a miRNA and a disease express different latent semantic association, MPSMDA make full use of all possible meta-paths to predict the associations of miRNAs with diseases. Extensive experiments are conducted on real datasets for performance comparison with existing approaches. Two case studies on lung neoplasms and breast neoplasms are also provided to demonstrate the effectiveness of MPSMDA.

Hao Lv, Jin Li, Sai Zhang, Kun Yue, Shaoyu Wei

Medical Question Retrieval Based on Siamese Neural Network and Transfer Learning Method

The online medical community websites have attracted an increase number of users in China. Patients post their questions on these sites and wait for professional answers from registered doctors. Most of these websites provide medical QA information related to the newly posted question by retrieval system. Previous researches regard such problem as question matching task: given a pair of questions, the supervised models learn question representation and predict it similar or not. In addition, there does not exist a finely annotated question pairs dataset in Chinese medical domain. In this paper, we declare two generation approaches to build large similar question datasets in Chinese health care domain. We propose a novel deep learning based architecture Siamese Text Matching Transformer model (STMT) to predict the similarity of two medical questions. It utilizes modified Transformer as encoder to learn question representation and interaction without extra manual lexical and syntactic resource. We design a data-driven transfer strategy to pre-train encoders and fine-tune models on different datasets. The experimental results show that the proposed model is capable of question matching task on both classification and ranking metrics.

Kun Wang, Bite Yang, Guohai Xu, Xiaofeng He

An Adaptive Kalman Filter Based Ocean Wave Prediction Model Using Motion Reference Unit Data

Fleets like the ocean drilling platforms need to remain stationary relative to the bottom of the ocean, therefore the ship or platform need to pay close attention to the fluctuation of ocean currents. The analysis of the ocean waves is of great significance to the stability of the ocean operation platforms and the safety of the staffs onboard. An effective ocean current prediction model is helpful both economically and ecologically. The fluctuations in the ocean waves can be seen as a series of sinusoidal time series data with different frequencies and is usually captured by the sensors known as MRU (Motion Reference Unit). The study aims to analyze and accurately predict the movement of the ocean in the future time based on the historical movement of the ocean currents collected by MRU. All of these data have a fixed high resolution acquisition frequency. This research focuses on how to effectively fill in the missing values in the time series of MRU data. We also aim to accurately predict the future ocean wave. Therefore, an novel ARIMA (Autoregressive Integrated Moving Average) Model based missing data completion method is proposed to fill the data by artificial approximation of missing data. More importantly, an novel adaptive Kalman filter based Ocean Wave Prediction model is proposed to predict the ocean current in the near future by leveraging dynamic wave length. Experiment results validates the correctness of the ARIMA model based missing data completion method. The adaptive Kalman filter based Ocean Wave Prediction model is also shown to be effective by outperforming three base line prediction models.

Yan Tang, Zequan Guo, Yin Wu

ASLM: Adaptive Single Layer Model for Learned Index

Index structures such as B-trees are important tools that DBAs use to enhance the performance of data access. However, with the approaching of the big data era, the amount of data generated in different domains have exploded. A recent study has shown that indexes consume about 55% of total memory in a state-of-the-art in-memory DBMS. Building indexes in traditional ways have encountered a bottleneck. Recent work proposes to use neural network models to replace B-tree and many other indexes. However, the proposed model is heavy, inaccuracy, and has failed to consider model updating. In this paper, a novel, simple learned index called adaptive single layer model is proposed to replace the B-tree index. The proposed model, using two data partition methods, is well-organized and can be applied to different workloads. Updating is also taken into consideration. The proposed model incorporates two data partition methods is evaluated in two datasets. The results show that the prediction error is reduced by around 50% and demonstrate that the proposed model is more accurate, stable and effective than the currently existing model.

Xin Li, Jingdong Li, Xiaoling Wang

SparseMAAC: Sparse Attention for Multi-agent Reinforcement Learning

In multi-agent scenario, each agent needs to aware other agents’ information as well as the environment to improve the performance of reinforcement learning methods. However, as the increasing of the agent number, this procedure becomes significantly complicated and ambitious in order to prominently improve efficiency. We introduce the sparse attention mechanism into multi-agent reinforcement learning framework and propose a novel Multi-Agent Sparse Attention Actor Critic (SparseMAAC) algorithm. Our algorithm framework enables the ability to efficiently select and focus on those critical impact agents in early training stages, while eliminates data noise simultaneously. The experimental results show that the proposed SparseMAAC algorithm not only exceeds those baseline algorithms in the reward performance, but also is superior to them significantly in the convergence speed.

Wenhao Li, Bo Jin, Xiangfeng Wang

The 4th International Workshop on Big Data Quality Management (BDQM 2019)

Frontmatter

Identifying Reference Relationship of Desktop Files Based on Access Logs

When writing a document, people sometimes refer to other files’ information such as a picture, a phone number, an email address, a table, a document and so on, therefore reference becomes a natural relationship among desktop files which can be utilized to help people re-finding personal information or identifying information linage. Therefore how to identify reference relationship is an interesting and valuable topic. In this paper, we propose an access log-based method to identify the relationship. Firstly we propose a method to generate access logs by monitoring user desktop operations, and implement a prototype based on the method, and collect several persons’ access logs by running it in personal computers. Then we propose an access logs-based method to identify reference relationship of desktop files. The experimental results verify the effectiveness and efficiency of our methods.

Yukun Li, Xun Zhang, Jie Li, Yuan Wang, Degan Zhang

Visualization of Photo Album: Selecting a Representative Photo of a Specific Event

In order to effectively manage photos in personal photo album and improve the efficiency of re-finding photos, the visualization of photo album has received attention. The most popular and reasonable visualization method is to display a representative photo of each photo cluster. We studied the characteristics of representative photos and then proposed a method of selecting the representative photos from a set of photos related to a specific event. The method mainly considered two aspects of photos: aesthetic quality and memorable factor. Aesthetic quality contains the area and location of the salient region and the sharpness of photo; memorable factors contain the salient people and text information. The experimental data sets are real-world personal photo collections, including more than 7,000 photos and more than 2000 specific events. The experimental results show the efficiency and reliability of selecting representative photos to visualization of photo album.

Yukun Li, Ming Geng, Fenglian Liu, Degan Zhang

Data Quality Management in Institutional Research Output Data Center

Institutional research output data center will store normative and convinced scholar’s research output data, and it will effectively support dynamic presentation of research output, reveal institutional academic publication in multiple dimensions, advance open access, and provide data support for subject evaluation and discipline development.In this paper, we propose a data quality management framework to build institutional research output data center, and put forward relevant technical solution for different data governance problems, such as department name similarity estimation in data matching, author name disambiguous problem in data merging and security issue in data exchange. We also introduce some learning algorithms such as text distance and community detection with matrix factorization. Comparing with different ways, our methods achieve good performance in quality manage processing.

Xiaohua Shi, Zhuoyuan Xing, Hongtao Lu

Generalized Bayesian Structure Learning from Noisy Datasets

In recent years, with the open data movement around the world, more and more open data sets are available. But, the quality of the datasets poses issues for learning models. This study focuses on learning the Bayesian network structure from data sets containing noise. A novel approach called GBNL (Generalized Bayesian Structure Learning) is proposed. GBNL first uses a greedy algorithm to obtain an appropriate sliding window size for any dataset, then it leverages a difference array-based method to quickly improve the data quality by locating the noisy data sections and removing them. GBNL can not only evaluate the quality of the data set but also effectively reduce the noise in the data. We conduct experiments to evaluate GBNL on five large datasets, the experiment results validate the accuracy and the generalizability of this novel approach.

Yan Tang, Yu Chen, Gaolong Ge

The Third International Workshop on Graph Data Management and Analysis (GDMA 2019)

Frontmatter

ANDMC: An Algorithm for Author Name Disambiguation Based on Molecular Cross Clustering

With the rapid development of information technology, the problem of name ambiguity has become one of the main problems in the fields of information retrieval, data mining and scientific measurement, which inevitably affects the accuracy of information calculations, reduces the credibility of the literature retrieval system, and affect the quality of information. To deal with this, name disambiguation technology has been proposed, which maps virtual relational networks to real social networks. However, most existing related work did not consider the problem of name coreference and the inability to correctly match due to the different writing formats between two same strings. This paper mainly proposes an algorithm for Author Name Disambiguation based on Molecular Cross Clustering (ANDMC) considering name coreference. Meanwhile, we explored the string matching algorithm called Improved Levenshtein Distance (ILD), which solves the problem of matching between two same strings with different writing format. The experimental results show that our algorithm outperforms the baseline method. (F1-score 9.48% 21.45% higher than SC and HAC).

Siyang Zhang, Xinhua E, Tao Huang, Fan Yang

Graph Based Aspect Extraction and Rating Classification of Customer Review Data

This paper introduces graph-based aspect and rating classification, which utilizes multi-modal word co-occurrence network to solve aspect and sentiment classification tasks. Our model consists of three components: (1) word co-occurrence network construction, with aspect and sentiment labels as different modes; (2) dispersion computation for aspects and sentiments, and; (3) feedforward network for classification. Our experiment shows that proposed model outperforms baseline models, Word2Vec and LDA, in both aspect and sentiment classification tasks. Our classification model uses comparatively smaller vector size for representing words and sentences. The proposed model performs better in classifying out of vocabulary contexts.

Sung Whan Jeon, Hye Jin Lee, Hyeonguk Lee, Sungzoon Cho

Streaming Massive Electric Power Data Analysis Based on Spark Streaming

Electric power user classification is one of the most important methods to realize the optimal allocation of power resources. Through the analysis of users’needs, behavior and habits, Countries and enterprises can offer different incentives for different users. In this way, people are more willing to use green and clean Electric power resources. In the analysis of user clustering, there is a need for real-time processing of massive and high-speed data. In this paper we propose a novel distributed user data stream clustering method based on Spark streaming, improved clusStream algorithm and improved K-means algorithm named “DStreamEPK”. In the final experimental evaluation, we first tested the clustering effectiveness of DStreamEPK on UCI datasets, the results show that the proposed DStreamEPK is better than the traditional K-means clustering algorithm. At the same time, it is found that DStreamEPK can cluster user’s electricity data quickly and efficiently through testing on user’s real data sets.

Xudong Zhang, Zhongwen Qian, Siqi Shen, Jia Shi, Shujun Wang

Posters

Frontmatter

Deletion-Robust k-Coverage Queries

The k-coverage query is an ideal solution for representative queries with almost known nice characteristics, such as stability, scale-invariance, traversal efficiency and so on. In this paper, we propose deletion-robust k-coverage queries. First, we calculate a coreset from the whole dataset with a sieving procedure by various thresholds to make k-coverage queries robust under deletion of arbitrary number of data points. Then our k-coverage queries can be carried out efficiently on the small coreset instead of the whole skyline set. Experiments on both synthetic and real datasets verify the effectiveness and efficiency of our proposed method.

Xingnan Huang, Jiping Zheng

Episodic Memory Network with Self-attention for Emotion Detection

Accurate perception of emotion from natural language text is key factors to the success of understanding what a person is expressing. In this paper, we propose an episodic memory network model with self-attention mechanism, which is expected to reflect an aspect, or component of the emotion sementics for given sentence. The self-attention allows extracting different aspects of the input text into multiple vector representation and the episodic memory aims to retrieve the information to answer the emotion category. We evaluate our approach on emotion detection and obtains state-of-the-art results comparison with baselines on pre-trained word embeddings without external knowledge.

Jiangping Huang, Zhong Lin, Xin Liu

Detecting Suicidal Ideation with Data Protection in Online Communities

Recent advances in Artificial Intelligence empower proactive social services that use virtual intelligent agents to automatically detect people’s suicidal ideation. Conventional machine learning methods require a large amount of individual data to be collected from users’ Internet activities, smart phones and wearable healthcare devices, to amass them in a central location. The centralized setting arises significant privacy and data misuse concerns, especially where vulnerable people are concerned. To address this problem, we propose a novel data-protecting solution to learn a model. Instead of asking users to share all their personal data, our solution is to train a local data-preserving model for each user which only shares their own model’s parameters with the server rather than their personal information. To optimize the model’s learning capability, we have developed a novel updating algorithm, called average difference descent, to aggregate parameters from different client models. An experimental study using real-world online social community datasets has been included to mimic the scenario of private communities for suicide discussion. The results of experiments demonstrate the effectiveness of our technology solution and paves the way for mental health service providers to apply this technology to real applications.

Shaoxiong Ji, Guodong Long, Shirui Pan, Tianqing Zhu, Jing Jiang, Sen Wang

Hierarchical Conceptual Labeling

The bag-of-words model is widely used in many AI applications. In this paper, we propose the task of hierarchical conceptual labeling (HCL), which aims to generate a set of conceptual labels with a hierarchy to represent the semantics of a bag of words. To achieve it, we first propose a denoising algorithm to filter out the noise in a bag of words in advance. Then the hierarchical conceptual labels are generated for a clean word bag based on the clustering algorithm of Bayesian rose tree. The experiments demonstrate the high performance of our proposed framework.

Haiyun Jiang, Cengguang Zhang, Deqing Yang, Yanghua Xiao, Jingping Liu, Jindong Chen, Chao Wang, Chenguang Li, Jiaqing Liang, Bin Liang, Wei Wang

Anomaly Detection in Time-Evolving Attributed Networks

Recently, there is a surge of research interests in finding anomalous nodes upon attributed networks. However, a vast majority of existing methods fail to capture the evolution of the networks properly, as they regard them as static. Meanwhile, they treat all the attributes and the instances equally, ignoring the existence of noisy. To tackle these problems, we propose a novel dynamic anomaly detection framework based on residual analysis, namely AMAD. It leverages the small smooth disturbance between time stamps to characterize the evolution of networks for incrementally update. Experiments conducted on several datasets show the superiority of AMAD in detecting anomalies.

Luguo Xue, Minnan Luo, Zhen Peng, Jundong Li, Yan Chen, Jun Liu

A Multi-task Learning Framework for Automatic Early Detection of Alzheimer’s

Alzheimer’s disease is a degenerative brain disease which threatens individuals’ living and even lives. In this paper, we develop a simple and inexpensive solution to perform early detection of Alzheimer’s, based on the individual’s background and behavioral data. To alleviate the data sparsity and feature misguidance problems, we propose a novel multi-task learning framework and a pairwise analysis strategy. Extensive experiments show that the proposed framework outperforms the state-of-the-art methods with higher prediction accuracy.

Nan Xu, Yanyan Shen, Yanmin Zhu

Top-k Spatial Keyword Quer with Typicality and Semantics

This paper proposes a top-k spatial keyword querying approach which can expeditiously provide top-k typical and semantically related spatial objects to the given query. The location-semantic relationships between spatial objects are first measured and then the Gaussian probabilistic density-based estimation method is leveraged to find a few representative objects from the dataset. Next, the order of remaining objects in the dataset can be generated corresponding to each representative object according to the location-semantic relationships. The online processing step computes the spatial proximity and semantic relevancy between query and each representative object, and then the orders can be used to facilitate top-k selection by using the threshold algorithm. Results of preliminary experiments showed the effectiveness of our method.

Xiangfu Meng, Xiaoyan Zhang, Lin Li, Quangui Zhang, Pan Li

Align Reviews with Topics in Attention Network for Rating Prediction

Rating prediction has long been a hot research topic in recommendation systems. Latent factor models, in particular, matrix factorization (MF), are the most prevalent techniques for rating prediction. However, MF based methods suffer from the problem of data sparsity and lack of explanation. In this paper, we present a novel model to address these problems by integrating ratings and topic-level review information into a deep neural framework. Our model can capture the varying attentions that a review contributes to a user/item at the topic level. We conduct extensive experiments on three datasets from Amazon. Results demonstrate our proposed method consistently outperforms the state-of-the-art recommendation approaches.

Yile Liang, Tieyun Qian, Huilin Yu

PSMSP: A Parallelized Sampling-Based Approach for Mining Top-k Sequential Patterns in Database Graphs

We study to improve the efficiency of finding top-k sequential patterns in database graphs, where each edge (or vertex) is associated with multiple transactions and a transaction consists of a set of items. This task is to discover the subsequences of transaction sequences that frequently appear in many paths. We propose PSMSP, a Parallelized Sampling-based Approach For Mining Top-k Sequential Patterns, which involves: (a) a parallelized unbiased sequence sampling approach, and (b) a novel PSP-Tree structure to efficiently mine the patterns based on the anti-monotonicity properties. We validate our approach via extensive experiments with real-world datasets.

Mingtao Lei, Xi Zhang, Jincui Yang, Binxing Fang

Value-Oriented Ranking of Online Reviews Based on Reviewer-Influenced Graph

To mitigate the uncertainty of online purchases, people rely on reviews written by customers who already bought the product to make their decisions. The key challenge in this situation is how to identify the most helpful reviews among a large number of candidate reviews with different quality. Existing work normally employs diversified text and sentiment analysis algorithms to analyze the helpfulness of reviews. Voting on reviews is another popular valuation way adopted by many websites, which also has difficulties to reflect the real helpfulness of the reviews due to the problem of data sparseness. In this paper, a reviewer-influenced graph model is constructed based on the reviewers’ historical reviews and voting information to measure the influence of reviewers’ quality on the helpfulness of reviews. Experimental results with actual review data from Amazon.com demonstrate the effectiveness of our approach.

Yiming Cao, Lizhen Cui, Wei He

Ancient Chinese Landscape Painting Composition Classification by Using Semantic Variational Autoencoder

In the theory of art, composition is based on the placement or arrangement of visual elements or ingredients in a painting to express the thoughts of the artist. Inspired by that, we propose a novel approach called Semantic Variational Autoencoder (SemanticVAE) to deal with the problem of ancient Chinese landscape painting composition classification. Extensive experiments are conducted on a real ancient Chinese landscape painting image dataset collected from museums. The experimental results show that, in contrast to the state-of-the-art deep CNNs, our method significantly improves the performance of ancient Chinese landscape painting composition classification.

Bo Yao, Qianzheng Ji, Xiangdong Zhou, Yue Pang, Manliang Cao, Yixuan Wu, Zijing Tan

Learning Time-Aware Distributed Representations of Locations from Spatio-Temporal Trajectories

The goal of location representation learning is to learn an embedded feature vector for each location. We propose a Time-Aware Location Embedding (TALE) method to learn distributed representations of locations from users’ spatio-temporal trajectories, in which a novel tree structure is designed to incorporate the temporal information in the hierarchical softmax model. We utilize TALE to improve two location-based prediction tasks to verify its effectiveness.

Huaiyu Wan, Fuchen Li, Shengnan Guo, Zhong Cao, Youfang Lin

Hyper2vec: Biased Random Walk for Hyper-network Embedding

Network embedding aims to obtain a low-dimensional representation of vertices in a network, meanwhile preserving structural and inherent properties of the network. Recently, there has been growing interest in this topic while most of the existing network embedding models mainly focus on normal networks in which there are only pairwise relationships between the vertices. However, in many realistic situations, the relationships between the objects are not pairwise and can be better modeled by a hyper-network in which each edge can join an uncertain number of vertices. In this paper, we propose a deep model called Hyper2vec to learn the embeddings of hyper-networks. Our model applies a biased $$2^{nd}$$ order random walk strategy to hyper-networks in the framework of Skip-gram, which can be flexibly applied to various types of hyper-networks.

Jie Huang, Chuan Chen, Fanghua Ye, Jiajing Wu, Zibin Zheng, Guohui Ling

Privacy-Preserving and Dynamic Spatial Range Aggregation Query Processing in Wireless Sensor Networks

The existing privacy-preserving aggregation query processing methods in sensor networks rely on pre-established network topology and require all nodes in the network to participate in query processing. Maintaining the topology results in a large amount of energy overhead, and in many cases, the user is interested only in the aggregated query results of some areas in the network, and thus, the participation of the entire network node is not necessary. Aiming to solve this problem, this paper proposes a spatial range aggregation query algorithm for a dynamic sensor network with privacy protection (E2PDA – Energy-efficient Privacy-preserving Data Aggregation). The algorithm does not rely on the pre-established topology but considers only the query area that the user is interested in, abandoning all nodes to participate in distributing the query messages while gathering the sensory data in the query range. To protect node data privacy, Shamir’s secret sharing technology is used to prevent internal attackers from stealing the sensitive data of the surrounding nodes. The analysis and experimental results show that the proposed algorithm outperforms the existing algorithms in terms of energy and privacy protection.

Lisong Wang, Zhenhai Hu, Liang Liu

Adversarial Discriminative Denoising for Distant Supervision Relation Extraction

Distant supervision has been widely used to generate labeled data automatically for relation extraction by aligning knowledge base with text. However, it introduces much noise, which can severely impact the performance of relation extraction. Recent studies have attempted to remove the noise explicitly from the generated data but they suffer from (1) the lack of an effective way of introducing explicit supervision to the denoising process and (2) the difficulty of optimization caused by the sampling action in denoising result evaluation. To solve these issues, we propose an adversarial discriminative denoising framework, which provides an effective way of introducing human supervision and exploiting it along with the potentially useful information underlying the noisy data in a unified framework. Besides, we employ a continuous approximation of sampling action to guarantee the holistic denoising framework to be differentiable. Experimental results show that very little human supervision is sufficient for our approach to outperform the state-of-the-art methods significantly.

Bing Liu, Huan Gao, Guilin Qi, Shangfu Duan, Tianxing Wu, Meng Wang

Nonnegative Spectral Clustering for Large-Scale Semi-supervised Learning

This paper proposes a novel clustering approach called Scalable Nonnegative Spectral Clustering (SNSC). Specifically, SNSC preserves the original nonnegative characteristic of the indicator matrix, which leads to a more tractable optimization problem with an accurate solution. Due to the nonnegativity, SNSC offers high interpretability to the indicator matrix, that is, the final cluster labels can be directly obtained without post-processing. SNSC also scales linearly with the data size, thus it can be easily applied to large-scale problems. In addition, limited label information can be naturally incorporated into SNSC for improving clustering performance. Extensive experiments demonstrate the superiority of SNSC as compared to the state-of-the-art methods.

Weibo Hu, Chuan Chen, Fanghua Ye, Zibin Zheng, Guohui Ling

Distributed PARAFAC Decomposition Method Based on In-memory Big Data System

We propose IM-PARAFAC, a PARAFAC tensor decomposition method that enables rapid processing of large scalable tensors in Apache Spark for distributed in-memory big data management systems. We consider the memory overflow that occurs when processing large amounts of data because of running on in-memory. Therefore, the proposed method, IM-PARAFAC, is capable of dividing and decomposing large input tensors. It can handle large tensors even in small, distributed environments. The experimental results indicate that the proposed IM-PARAFAC enables handling of large tensors and reduces the execution time.

Hye-Kyung Yang, Hwan-Seung Yong

GPU-Accelerated Dynamic Graph Coloring

The graph coloring is a classic problem in the graph theory, which can be leveraged to mark two objects with a certain relationship with different colors. Existing graph coloring solutions mainly focus on efficiently calculating high-quality coloring of static graphs. However, many graphs in the real world are highly dynamic and the coloring result changes when the graph is updated. Repeated adoption of static graph coloring schemes will incur prohibitive costs. Although some CPU-based incremental graph coloring methods have been proposed recently, they become inefficient when facing dense graphs and large batch updates. In this paper, we explore the dynamic graph coloring solution by utilizing the powerful parallel processing capabilities of GPU and propose a CPU-GPU heterogeneous method. We conduct extensive experiments comparing our algorithm with the existing methods. The results confirm that our algorithm is superior to others in many aspects such as coloring efficiency.

Ying Yang, Yu Gu, Chuanwen Li, Changyi Wan, Ge Yu

Relevance-Based Entity Embedding

Entity embedding plays an indispensable role in many entity-related problems. Currently, mainstream entity embedding methods build on the notion that entities with similar contexts or close proximity should be placed adjacently in the embedding space. Nonetheless, this goal fails to meet the objectives of many downstream tasks, where the relevance among entities is more significant. To fill this gap, in this paper, a novel relevance-based entity embedding approach, Lead, is proposed, where the relevance is captured via query-document information. The experimental results verify the superiority of our proposal.

Weixin Zeng, Xiang Zhao, Jiuyang Tang, Jinzhi Liao, Chang-Dong Wang

An Iterative Map-Trajectory Co-optimisation Framework Based on Map-Matching and Map Update

The digital map has long been suffering from low data quality issues caused by lengthy update period. Recent research on map inference/update shows the possibility of updating the map using vehicle trajectories. However, since trajectories are intrinsically inaccurate and sparse, the existing map correction methods are still inaccurate and incomplete. In this work, we propose an iterative map-trajectory co-optimisation framework that takes raw trajectories and the map as input and improves the quality of both datasets simultaneously. The map and map-matching qualities are quantified by our proposed measures. We also propose two scores to measure the credibility and influence of new road updates. Overall, our framework supports most of the existing map inference/update methods and can directly improve the quality of their updated map. We conduct extensive experiments on real-world datasets to demonstrate the effectiveness of our solution over other candidates.

Pingfu Chao, Wen Hua, Xiaofang Zhou

Exploring Regularity in Traditional Chinese Medicine Clinical Data Using Heterogeneous Weighted Networks Embedding

Regularities of prescriptions are important for both clinical practice and novel healthcare development in clinical traditional Chinese medicine (TCM). To address this issue and meet clinical demand for determining treatments, we propose an unsupervised analysis model termed AMNE to determine effective herbs for diverse symptoms in prescriptions. Results confirmed by human physicians demonstrate AMNE can outperform several previous TCM regularity discovery models in prescriptions.

Chunyang Ruan, Ye Wang, Yanchun Zhang, Yun Yang

AGREE: Attention-Based Tour Group Recommendation with Multi-modal Data

Tour recommendation aims to design a sequence of Points of Interest (POIs) for a tourist that suits his/her preference. Most existing tour recommenders mainly focus on recommending a POI sequence to a single tourist but cannot be applied to the tour group, which is a common way to travel. Designing a tour group recommender is more challenging in aggregating group preference and tracking influence changes during a tour. Hence we propose a novel approach named AGREE (Attention-based Tour Group Recommendation), which leverages the attention mechanism, to adjust members’ influence dynamically. Specifically, our model aggregates group’s preference based on members’ history data in different modalities, utilizing attention sub-networks to focus on influential ones in each modality across a POI sequence. Then we adopt a bi-directional recurrent unit (Bi-GRU) to generate the POI sequence. Experimental results show that the proposed scheme outperforms benchmark methods on a real-world dataset.

Fang Hu, Xiuqi Huang, Xiaofeng Gao, Guihai Chen

Random Decision DAG: An Entropy Based Compression Approach for Random Forest

Tree ensembles, such as Random Forest (RF), are popular methods in machine learning because of their efficiency and superior performance. However, they always grow big trees and large forests, which limits their use in many memory constrained applications. In this paper, we propose Random decision Directed Acyclic Graph (RDAG), which employs an entropy-based pre-pruning and node merging strategy to reduce the number of nodes in random forest. Empirical results show that the resulting model, which is a DAG, dramatically reduces the model size while achieving competitive classification performance when compared to RF.

Xin Liu, Xiao Liu, Yongxuan Lai, Fan Yang, Yifeng Zeng

Generating Behavior Features for Cold-Start Spam Review Detection

Existing studies on spam detection show that behavior features are effective in distinguishing spam and legitimate reviews. However, it usually takes a long time to collect such features and is hard to apply them to cold-start spam review detection tasks. In this paper, we exploit the generative adversarial network for addressing this problem. The key idea is to generate synthetic behavior features (SBFs) for new users from their easily accessible features (EAFs). We conduct extensive experiments on two Yelp datasets. Experimental results demonstrate that our proposed framework significantly outperforms the state-of-the-art methods.

Xiaoya Tang, Tieyun Qian, Zhenni You

TCL: Tensor-CNN-LSTM for Travel Time Prediction with Sparse Trajectory Data

Predicting the travel time of a given path plays an indispensable role in intelligent transportation systems. Although many prior researches have struggled for accurate prediction results, most of them achieve inferior performance due to insufficient extraction of travel speed features from the sparse trajectory data, which confirms the challenges involved in this topic. To overcome those issues, we propose a deep learning framework named Tensor-CNN-LSTM (TCL) in this paper, which can extract travel speed effectively from historical sparse trajectory data and predict travel time with better accuracy. Empirical results over two real-world large-scale datasets show that our proposed TCL can achieve significantly better performance and remarkable robustness.

Yibin Shen, Jiaxun Hua, Cheqing Jin, Dingjiang Huang

A Semi-supervised Classification Approach for Multiple Time-Varying Networks with Total Variation

In recent years, we have seen a surge of research on semi-supervised learning for improving classification performance due to the extreme imbalance between labeled and unlabeled data. In this paper, we innovatively propose a semi-supervised classification model for multiple time-varying networks, i.e., Multiple time-varying Networks Classification with Total variation (MNCT), which can integrate the multiple time-varying networks and select relevant ones. From a numerical point of view, the optimization is decomposed into two sub-problems, which can be solved efficiently under the alternating direction method of multipliers (ADMM) framework. Experimental results on both synthetic and real-world datasets empirically demonstrate the advantages of MNCT over state-of-the-art methods.

Yuzheng Li, Chuan Chen, Fanghua Ye, Zibin Zheng, Guohui Ling

Multidimensional Skylines over Streaming Data

We consider a stream where each record is described by a set of dimensions D. The records have a validity time interval of size $$\omega $$ . The queries we consider consist in retrieving the valid skyline records with respect to subsets $$D'$$ (subspace) of D. Answering multidimensional skyline queries over streaming data is a hard task because of the data velocity and even index structures that optimize these queries need to be continuously updated. To overcome this difficulty, we propose a framework that handles the streaming data in a micro-batch mode together with an incrementally maintainable index structure.

Karim Alami, Sofian Maabout

A Domain Adaptation Approach for Multistream Classification

In this paper, we formulate cross-domain multistream classification as a domain adaptation problem. Then we propose a novel algorithm that utilizes low-rank representation and graph embedding to preserve data structures, which benefits in dealing with concept drifts and concept revolution. In addition, we deploy MMD metric to minimize the distribution discrepancy between the source data stream and the target data stream. Experiment results on Office+Caltech dataset with DeCAF $$_6$$ features verified the effectiveness of our algorithm.

Yue Xie, Jingjing Li, Mengmeng Jing, Ke Lu, Zi Huang

Gradient Boosting Censored Regression for Winning Price Prediction in Real-Time Bidding

The demand-side platform (DSP) is a technological ingredient that fits into the larger real-time-bidding (RTB) ecosystem. DSPs enable advertisers to purchase ad impressions from a wide range of ad slots, generally via a second-price auction mechanism. In this aspect, predicting the auction winning price notably enhances the decision for placing the right bid value to win the auction and helps with the advertiser’s campaign planning and traffic reallocation between campaigns. This is a difficult task because the observed winning price distribution is biased due to censorship; the DSP only observes the win price in case of winning the auction. For losing bids, the win price remains censored. Erstwhile, there has been little work that utilizes censored information in the learning process. In this article, we generalize the winning price model to incorporate a gradient boosting framework adapted to learn from both observed and censored data. Experiments show that our approach yields the hypothesized boost in predictive performance in comparison to classic linear censored regression.

Piyush Paliwal, Oleksii Renov

Deep Sequential Multi-task Modeling for Next Check-in Time and Location Prediction

In this paper, we address the problem of next check-in time and location prediction, and propose a deep sequential multi-task model, named Personalized Recurrent Point Process with Attention (PRPPA), which seamlessly integrates user static representation learning, dynamic recent check-in behavior modeling, and temporal point process into a unified architecture. An attention mechanism is further included in the intensity function of point process to enhance the capability of explicitly capturing the effect of past check-in events. Through the experiments, we verify the proposed model is effective in location and time prediction.

Wenwei Liang, Wei Zhang, Xiaoling Wang

SemiSync: Semi-supervised Clustering by Synchronization

In this paper, we consider the semi-supervised clustering problem, where the prior knowledge is formalized as the Cannot-Link (CL) and Must-Link (ML) pairwise constraints. We propose an algorithm called SemiSync that tackles this problem from a novel perspective: synchronization. The basic idea is to regard the data points as a set of (constrained) phase oscillators, and simulate their dynamics to form clusters automatically. SemiSync allows dynamically propagating the constraints to unlabelled data points driven by their local data distributions, which effectively boosts the clustering performance even if little prior knowledge is available. We experimentally demonstrate the effectiveness of the proposed method.

Zhong Zhang, Didi Kang, Chongming Gao, Junming Shao

Neural Review Rating Prediction with Hierarchical Attentions and Latent Factors

Text reviews can provide rich useful semantic information for modeling users and items, which can benefit rating prediction in recommendation. Different words and reviews may have different informativeness for users or items. Besides, different users and items should be personalized. Most existing works regard all reviews equally or utilize a general attention mechanism. In this paper, we propose a hierarchical attention model fusing latent factor model for rating prediction with reviews, which can focus on important words and informative reviews. Specially, we use the factor vectors of Latent Factor Model to guide the attention network and combine the factor vectors with feature representation learned from reviews to predict the final ratings. Experiments on real-world datasets validate the effectiveness of our approach.

Xianchen Wang, Hongtao Liu, Peiyi Wang, Fangzhao Wu, Hongyan Xu, Wenjun Wang, Xing Xie

MVS-match: An Efficient Subsequence Matching Approach Based on the Series Synopsis

Subsequence matching is a fundamental task in mining time series data. The UCR Suite approach can deal with normalized subsequence matching problem (NSM), but it needs to scan full time series. In this paper, we propose to deal with the subsequence matching problem based on a simple series synopsis, the mean values of the disjoint windows. We propose a novel problem, named constrained normalized subsequence matching problem (cNSM), which adds some constraints to NSM problem. We propose a query processing approach, named MVS-match, to process the cNSM query efficiently. The experimental results verify the effectiveness and efficiency of our approach.

Kefeng Feng, Jiaye Wu, Peng Wang, Ningting Pan, Wei Wang

Spatial-Temporal Recommendation for On-demand Cinemas

The on-demand cinema is an emerging offline entertainment venue, guiding a new mode of watching movies in recent years. As a breakthrough in the development of the post-movie industry, the on-demand cinemas rely on private booths, high-quality hardware and rich movie resources to provide audiences with new and fresh watching experiences. The recommendation system for on-demand cinemas is to recommend to cinemas movies that may be of interest to their potential audiences, and provide an individualized recommendation service for preparing movie storage of on-demand cinemas to meet the audiences’ preferences and instant watching needs. The characteristics implied in the audience behaviors of on-demand cinemas make the recommendation method for them different from those for online videos, items in offline stores or a group of users. In this paper, we describe the challenges and build a system for this application scenario, which fuses the historical on-demand records of cinemas, the POI (Point of Interest) information around cinemas and the content descriptions of movies, and explores the temporal dynamics and spatial influences rooted in audience behaviors. A WeChat applet customized for on-demand cinema staffs/hosts, as the client of our system, has been put into in practice.

Taofeng Xue, Beihong Jin, Beibei Li, Kunchi Liu, Qi Zhang, Sihua Tian

Finding the Key Influences on the House Price by Finite Mixture Model Based on the Real Estate Data in Changchun

Nowadays it’s difficult for us to analyze the development law of real estate. What’s more, predictable house price and understandable key influences can also build a healthier real estate market. Therefore, we propose a model which can predict the house price, while it can find key influences which are important influences on the house price. Our method is inspired by the finite mixture model (FMM) and information gain ratio (IGO). Specifically, we collect data that includes detail information about houses and communities from Anjuke Inc. which is an online platform for house sales. Then, according to the data, we find the scope of latent groups number by cluster methods to avoid blind searching the number of latent groups. Next, we use IGO to rank the features and weight them and we build a regression model based on the finite mixture model. Finally, the experimental results demonstrate our method performance on predicting house price, and we find key influences on house price.

Xin Xu, Zeyu Huang, Jingyi Wu, Yanjie Fu, Na Luo, Weitong Chen, Jianan Wang, Minghao Yin

Semi-supervised Clustering with Deep Metric Learning

Semi-supervised clustering has attracted lots of reserach interest due to its broad applications, and many methods have been presented. However there is still much space for improvement, (1) How to learn more discriminative feature representations to assist the traditional clustering methods; (2) How to make use of both the labeled and unlabelled data simultaneously and effectively during the process of clustering. To address these issues, we propose a novel semi-supervised clustering based on deep metric learning, namely SSCDML. By leveraging deep metric learning and semi-supervised learning effectively in a novel way, SSCDML dynamically update the unlabelled to labeled data through the limited labeled samples and obtain more meaningful data features, which make the classifier model more robust and the clustering results more accurate. Experimental results on Mnist, YaleB, and 20 Newsgroups databases demonstrate the high effectiveness of our proposed approach.

Xiaocui Li, Hongzhi Yin, Ke Zhou, Hongxu Chen, Shazia Sadiq, Xiaofang Zhou

Spatial Bottleneck Minimum Task Assignment with Time-Delay

Spatial crowdsourcing (SC) services become very popular. And one basic problem in SC is how to appropriately assign tasks to workers for better user experience. Most of existing researches focus on utilitarian optimization objectives for the benefit of the platform, such as maximizing the number of performed tasks, maximizing the total utility of the assignment, and minimizing the total cost to perform all tasks. However, users (i.e., task-requesters and workers) usually only care about their own cost (i.e., each user hopes his/her cost in the assignment to be small) instead of such those utilitarian optimization objectives. From the perspective of users, we propose an egalitarian version of online task assignment problem in SC, namely Minimizing Bottleneck with Time-Delay in Spatial Crowdsourcing (MBTD-SC). We further devise a heuristic algorithm to solve it. Finally, we validate the effectiveness of the proposed algorithm on both synthetic and real datasets.

Long Li, Jingzhi Fang, Bowen Du, Weifeng Lv

A Mimic Learning Method for Disease Risk Prediction with Incomplete Initial Data

Huge amounts of electronic health records (EHRs) accumulated in recent years have provided a rich foundation for disease risk prediction. However, the challenging problems of incompletion in raw data and interpretability of prediction model are not solved very well so far. In this study, we present a mimic learning approach for disease risk prediction with large ratio of missing values, called SR-DF, as one of the early attempts. Specifically, we adopt spectral regularization for incomplete medical data learning, on which the missingness among raw data can be more accurately measured and imputed. Moreover, by utilizing deep forest, we get an effective method that takes advantages of interpretable and reliable model for disease risk prediction, which requires far fewer parameters and is less sensitive to parameter settings. As we will report in the experiments, the proposed method outperforms the baselines and achieves relatively consistent and stable results.

Lin Yue, Haonan Zhao, Yiqin Yang, Dongyuan Tian, Xiaowei Zhao, Minghao Yin

Hospitalization Behavior Prediction Based on Attention and Time Adjustment Factors in Bidirectional LSTM

Predicting the future medical treatment behaviors of patients from historical health insurance data is an important research hotspot. The most important challenge of this issue is how to correctly model such temporal and high dimensional data to significantly improve the prediction performance. In this paper, we propose an Attention and Time adjustment factors based Bidirectional LSTM hospitalization behavior prediction model (ATB-LSTM). The model uses a hidden layer to preserve the impact state of medical visit sequences at different time on future prediction, and introduces the attention mechanism and the time adjustment factor to jointly determine the strength of the hidden state at different moments, which significantly improves the predictive performance of the model.

Lin Cheng, Yongjian Ren, Kun Zhang, Li Pan, Yuliang Shi

Modeling Item Categories for Effective Recommendation

One-class collaborative filtering and item cold-start are two of the most important and challenging problems in recommender systems. In this paper, we focus on addressing these two issues by taking item category information into account. Item categories embody rich information about product attributes, which are available in most E-commerce websites. However, existing methods usually ignore such information or utilize them at a shallow level. For example, the category information is used to regularize model parameters or to extract hand-crafted features. As a response, we propose to model users’ different preference spaces over different item domains. Specifically, we design a unified method called CatRec in order to model the complex interactions among a user, an item and the item’s category information. Empirically, our method consistently outperforms the state-of-the-art methods on two real-world datasets.

Bo Song, Yi Cao, Weike Pan, Congfu Xu

Distributed Reachability Queries on Massive Graphs

Reachability querying is one of the most fundamental graph problems, and it has many applications in graph analytics and processing. Although a substantial number of algorithms are proposed, most of them are centralized and cannot scale to massive graphs that are distributed across multiple data centers. In this paper, we study the problem of distributed reachability queries, and present an efficient distributed reachability index, called Parallel Vertex Labeling (PVL), together with two lemmas to accelerate the query. Using real datasets, we demonstrate that our solutions achieve better indexing performance and significantly outperform the state-of-the-art distributed techniques.

Tianming Zhang, Yunjun Gao, Congzheng Li, Congcong Ge, Wei Guo, Qiang Zhou

Edge-Based Shortest Path Caching in Road Networks

In this paper, we propose an edge-based shortest path cache that can efficiently handle large-scale path queries without needing any road information. We achieve this by designing a totally new edge-based path cache structure, an efficient R-tree-based cache lookup algorithm, and a greedy-based cache construction algorithm. Experiments on a real road network and real POI datasets are conducted, and the results show the efficiency of our proposed caching techniques.

Detian Zhang, An Liu, Gaoming Jin, Qing Li

Extracting Definitions and Hypernyms with a Two-Phase Framework

Extracting definition sentences and hypernyms is the key step in knowledge graph construction as well as many other NLP applications. In this paper, we propose a novel supervised two-phase machine learning framework to solve both tasks simultaneously. Firstly, a joint neural network is trained to predict both definition sentences and hypernyms. Then a refinement model is utilized to further improve the performance of hypernym extraction. Experiment result shows the effectiveness of our proposed framework on a well-known benchmark.

Yifang Sun, Shifeng Liu, Yufei Wang, Wei Wang

Tag Recommendation by Word-Level Tag Sequence Modeling

In this paper, we transform tag recommendation into a word-based text generation problem and introduce a sequence-to-sequence model. The model inherits the advantages of LSTM-based encoder for sequential modeling and attention-based decoder with local positional encodings for learning relations globally. Experimental results on Zhihu datasets illustrate the proposed model outperforms other state-of-the-art text classification based methods.

Xuewen Shi, Heyan Huang, Shuyang Zhao, Ping Jian, Yi-Kun Tang

A New Statistics Collecting Method with Adaptive Strategy

Collecting statistics is a time- and resource-consuming operation in distributed database systems. It is even more challenging to efficiently collect statistics without affecting system performance, meanwhile keeping correctness in a distributed environment. Traditional strategies usually consider one dimension during collecting statistics, which is lack of generalization. In this paper, we propose a new statistics collecting method with adaptive strategy (APCS), which well leverages collecting efficiency, correctness of statistics and effect to system performance. APCS picks appropriate time to trigger collecting action and filter unnecessary tasks, meanwhile reasonably allocates collecting tasks to appropriate executing locations with right executing model.

Jin-Tao Gao, Wen-Jie Liu, Zhan-Huai Li, Hong-Tao Du, Ou-Ya Pei

Word Sense Disambiguation with Massive Contextual Texts

Word sense disambiguation is crucial in natural language processing. Both unsupervised knowledge-based and supervised methodologies try to disambiguate ambiguous words through context. However, they both suffer from data sparsity, a common problem in natural language. Furthermore, the supervised methods are previously limited in the all-word WSD tasks. This paper attempts to collect all publicly available contexts to enrich the ambiguous word’s sense representation and apply these contexts to the simplified Lesk and our M-IMS systems. Evaluations performed on the concatenation of several benchmark fine-grained all-word WSD datasets show that the simplified Lesk improves by 9.4% significantly and our M-IMS has shown some improvement as well.

Ya-fei Liu, Jinmao Wei

Learning DMEs from Positive and Negative Examples

The presence of a schema for XML documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a (valid) schema. Therefore, it is essential to devise algorithms to infer schemas from XML documents. The fundamental task in XML schema inference is learning regular expressions. In this paper we consider unordered XML, where the relative order among siblings is ignored, and focus on the subclass called disjunctive multiplicity expressions (DMEs) which are proposed for unordered XML. Previous work in this direction lacks inference algorithms that support for learning DME from both positive and negative examples. In this paper, we provide an algorithm to learn DMEs from both positive and negative examples based on genetic algorithms.

Yeting Li, Chunmei Dong, Xinyu Chu, Haiming Chen

Serial and Parallel Recurrent Convolutional Neural Networks for Biomedical Named Entity Recognition

Identifying named entities from unstructured biomedical text is an important part of information extraction. The irrelevant words in long biomedical sentences and the complex composition of the entity make LSTM used in the general domain less effective. We find that emphasizing the local connection between words in a biomedical entity can improve performance. Based on the above observation, this paper proposes two novel neural network architectures combining bidirectional LSTM and CNN. In the first architecture S-CLSTM, a CNN structure is built on the top of bidirectional LSTM to keep both long dependencies in a sentence and local connection between words. The second architecture P-CLSTM combines bidirectional LSTM and CNN in parallel with the weighted loss to take advantage of the complementary features of two networks. Experimental results indicate that our architectures achieve significant improvements compared with baselines and other state-of-the-art approaches.

Qianhui Lu, Yunlai Xu, Runqi Yang, Ning Li, Chongjun Wang

DRGAN: A GAN-Based Framework for Doctor Recommendation in Chinese On-Line QA Communities

Recently, more and more people choose to seek health-related information in health-related on-line QA communities. Doctor recommendation is very essential for users in these communities since it is difficult for them to find a proper doctor without assistance from medical staffs. In this paper, we develop a Generative Adversarial Nets (GANs)-based doctor recommendation framework utilizing data in Chinese on-line QA communities. We conduct extensive sets of experiments on a real-world dataset. The experimental results show that our framework significantly outperforms the state-of-the-art baselines.

Bing Tian, Yong Zhang, Xinhuan Chen, Chunxiao Xing, Chao Li

Attention-Based Abnormal-Aware Fusion Network for Radiology Report Generation

Radiology report writing is error-prone, time-consuming and tedious for radiologists. Medical reports are usually dominated by a large number of normal findings, and the abnormal findings are few but more important. Current report generation methods often fail to depict these prominent abnormal findings. In this paper, we propose a model named Attention-based Abnormal-Aware Fusion Network (A3FN). We break down sentence generation into abnormal and normal sentence generation through a high level gate module. We also adopt a topic guide attention mechanism for better capturing visual details and develop a context-aware topic vector for model cross-sentence topic coherence. Experiments on real radiology image datasets demonstrate the effectiveness of our proposed method.

Xiancheng Xie, Yun Xiong, Philip S. Yu, Kangan Li, Suhua Zhang, Yangyong Zhu

LearningTour: A Machine Learning Approach for Tour Recommendation Based on Users’ Historical Travel Experience

Tour routes designning is a non-trival step for the tourists who want to take an excursion journey in someplace which he or she is not familiar with. For most tourists this problem represents an excruciating challenge due to such unfamiliarity. Few existing works focus on using other tourists’ experiences in the city to recommend a personalized route for the new comers. To take full advantage of tourists’ historical routes in route recommendation, we propose LearningTour, a model recommending routes by learning how other tourists travel in the city before. Giving that the tourist’s route is actually a special variance of time sequence, we treat such route as a special language and thus treat such recommendation process as a unique translation process. Therefore we use a sequence-to-sequence (seq2seq) model to proceed such learning and do the recommendation job. This model comprises a encoder and a decoder. The encoder encodes users’ interest to the context vector and the decoder decodes the vector to the generated route. Finally, we implemented our model on several real datasets and demonstrate its effeciency.

Zhaorui Li, Yuanning Gao, Xiaofeng Gao, Guihai Chen

TF-Miner: Topic-Specific Facet Mining by Label Propagation

Mining facets of topics is an essential task nowadays. Facet heterogeneity and long tail characteristic of information make facet mining tasks difficult. In this paper we propose a weakly supervised approach, called Topic-specific Facet (TF)-Miner, to mine TFs automatically by a Label Propagation algorithm (LPA). The process of propagation helps us mine complete facet sets. Experiments on several real-world datasets show that TF-Miner achieves better performance than the facet mining approaches which rely on the texts only.

Zhaotong Guo, Bifan Wei, Jun Liu, Bei Wu

Fast Raft Replication for Transactional Database Systems over Unreliable Networks

Raft, a consensus algorithm, has been widely used in many open source database systems to enhance the availability and to guarantee the consistency. However, due to the constraint of coherent log entries, the transactional database systems adopting Raft replication do not perform well in the case of unreliable network environment. This is because with the relatively frequent occurrence of network failures, the serial log replication—which is guaranteed by the log coherency—can block the commit of transactions. In this paper, we propose the fast Raft replication (FRaft) protocol. FRaft adopts the term coherency property, which has a good tolerance for the unstable networks. Meanwhile, FRaft can be implemented easily by extending the basic Raft. Our experimental results show that our replication scheme has better throughput.

Peng Cai, Jinwei Guo, Huan Zhou, Weining Qian, Aoying Zhou

Parallelizing Big De Bruijn Graph Traversal for Genome Assembly on GPU Clusters

De Bruijn graph traversal is a critical step in de novo assemblers. It uses the graph structure to analyze genome sequences and is both memory space intensive and time consuming. To improve the efficiency, we develop ParaGraph, which parallelizes De Bruijn graph traversal on a cluster of GPU-equipped computer nodes. With effective vertex partitioning and fine-grained parallel algorithms, ParaGraph utilizes all cores of each CPU and GPU, all CPUs and GPUs in a computer node, and all computer nodes of a cluster. Our results show that ParaGraph is able to traverse billion-node graphs within three minutes on a cluster of six GPU-equipped computer nodes. It is an order of magnitude faster than the state-of-the-art shared memory based assemblers, and more than five times faster than the current distributed assemblers.

Shuang Qiu, Zonghao Feng, Qiong Luo

GScan: Exploiting Sequential Scans for Subgraph Matching

Subgraph matching is to enumerate all the subgraphs of a graph that is isomorphic to the query graph. It is a critical component of many applications such as clustering coefficient computation and trend evolution. As the real-world graph grows explosively, we have massive graphs that are much larger than the memory size of the modern machines. Therefore, in this paper, we study the subgraph matching problem where the graph is stored on disk. Different from the existing approaches, we design a block-based approach, $$\mathsf {GScan}$$ , which investigates the schedule of the blocks transferred between the memory and the disk. To achieve high I/O efficiency, $$\mathsf {GScan}$$ only uses sequential I/O read operations. We conduct experimental studies to demonstrate the efficiency of our block-based approach.

Zhiwei Zhang, Hao Wei, Jianliang Xu, Byron Choi

SIMD Accelerates the Probe Phase of Star Joins in Main Memory Databases

In main memory databases, the joins on star schema tables cost the majority of time, which is dominated by the expensive probe phase. In this paper, we vertically or horizontally vectorize the probe phase using SIMD. In addition, we speed up the vectorized probe by prefetching. As our results show, the vertical vectorized integrated probe is up to 2.19X (2.63X) faster than its scalar version, as well as 3.24X (2.74X) faster than the traditional execution based on the right-deep-tree plans on CPU processors (co-processors).

Zhuhe Fang, Zeyu He, Jiajia Chu, Chuliang Weng

A Deep Recommendation Model Incorporating Adaptive Knowledge-Based Representations

Deep neural networks (DNNs) have been widely imported into collaborative filtering (CF) based recommender systems and yielded remarkable superiority, but most models perform weakly in the scenario of sparse user-item interactions. To address this problem, we propose a deep knowledge-based recommendation model in which item knowledge distilled from open knowledge graphs and user information are both incorporated to extract sufficient features. Moreover, our model compresses features by a convolutional neural network and adopts memory-enhanced attention mechanism to generate adaptive user representations based on latest interacted items rather than all historical records. Our extensive experiments conducted against a real-world dataset demonstrate our model’s remarkable superiority over some state-of-the-art deep models.

Chenlu Shen, Deqing Yang, Yanghua Xiao

BLOMA: Explain Collaborative Filtering via Boosted Local Rank-One Matrix Approximation

Matrix Approximation (MA) is a powerful technique in recommendation systems. There are two main problems in the prevalent MA framework. First, the latent factor is out of explanation and hampers the understanding of the reasons behind recommendations. Besides, traditional MA methods produce user/item factors globally, which fails to capture the idiosyncrasies of users/items. In this paper, we propose a model called Boosted Local rank-One Matrix Approximation (BLOMA). The core idea is to locally and sequentially approximate the residual matrix (which represents the unexplained part obtained from the previous stage) by rank-one sub-matrix factorization. The result factors are distinct and explainable by leveraging social networks and item attributes.

Chongming Gao, Shuai Yuan, Zhong Zhang, Hongzhi Yin, Junming Shao

Spatiotemporal-Aware Region Recommendation with Deep Metric Learning

Personalized points of interests (POI) recommendation is an important basis for location-based services. A typical application scenario is to recommend a region with reliable POIs to a user when he/she travels to an unfamiliar area without any background knowledge. In this study, we explore spatiotemporal-aware region recommendation to manage this learning task. We propose a unified deep learning model that comprehensively incorporates dynamic personal and global user preferences across regions, along with spatiotemporal dependencies, into check-in region history. We model and fuse user preferences through a pyramidal ConvLSTM component, and capture the dynamic region attributes through a recurrent component. Two components are seamlessly assembled in a unified framework to yield next time region recommendation. Extensive experiments on real-word datasets demonstrate the effectiveness of the proposed model.

Hengpeng Xu, Yao Zhang, Jinmao Wei, Zhenglu Yang, Jun Wang

On the Impact of the Length of Subword Vectors on Word Embeddings

This paper hypothesizes that better word embeddings can be learned by representing words and subwords by different lengths of vectors. To investigate the impact of the length of subword vectors on word embeddings, this paper proposes a model based on the Subword Information Skip-gram model. The experiments on two datasets with respect to two tasks show that the proposed model outperforms 6 baselines, which confirms the aforementioned hypothesis. In addition, we also observe that, within a specific range, a higher dimensionality of subword vectors always improve the quality of word embeddings.

Xiangrui Cai, Yonghong Luo, Ying Zhang, Xiaojie Yuan

Using Dilated Residual Network to Model Distantly Supervised Relation Extraction

Distantly supervised relation extraction has been widely used to find relational facts in the text. However, distant supervision inevitably brings in noise that can lead to a bad relation contextual representation. In this paper, we propose a deep dilated residual network (DRN) model to address the noise of in distantly supervised relation extraction. Specifically, we design a module which employs dilated convolution in cascade to capture multi-scale context features by adopting multiple dilation rates. By combining them with residual learning, the model is more powerful than traditional CNN model. Our model significantly improves the performance for distantly supervised relation extraction on the large NYT-Freebase dataset compared to various baselines.

Lei Zhan, Yan Yang, Pinpin Zhu, Liang He, Zhou Yu

Modeling More Globally: A Hierarchical Attention Network via Multi-Task Learning for Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis problem, which has attracted much attention in recent years. Previous methods mainly devote to employing attention mechanism to model the relationship between aspects and context words. However, these methods tend to ignore the overall semantics of sentence and dependency among the aspect terms. In this paper, we propose a Hierarchical Attention Network (HAN) to solve the aforementioned issues simultaneously. Experimental results on standard SemEval 2014 datasets demonstrate the effectiveness of the proposed model.

Xiangying Ran, Yuanyuan Pan, Wei Sun, Chongjun Wang

A Scalable Sparse Matrix-Based Join for SPARQL Query Processing

In this paper, we present gSMat, a SPARQL query engine for RDF datasets. It employs join optimization and data sparsity. We bifurcate gSMat into three submodules e.g. Firstly, SM Storage (Sparse Matrix-based Storage) which lifts the storage efficiency, by storing valid edges, introduces a predicate-based hash index on the storage and generate a statistic file for optimization. Secondly, Query Planner which holds Query Parser and Query Optimizer. The Query Parser module parses a SPARQL query and transformed it into a query graph and the latter generates the optimal query plan based on statistical input from SM Storage. Thirdly, Query Executor module executes query in an efficient manner. Lastly, gSMat evaluated by comparing with some well-known approaches like gStore and RDF3X on very large datasets (over 500 million triples). gSMat is proved as significantly efficient and scalable.

Xiaowang Zhang, Mingyue Zhang, Peng Peng, Jiaming Song, Zhiyong Feng, Lei Zou

Change Point Detection for Streaming High-Dimensional Time Series

An important task in analysing high-dimensional time series data generated from sensors in the Internet of Things (IoT) platform is to detect changes in the statistical properties of the time series. Accurate, efficient and near real-time detection of change points in such data is challenging due to the streaming nature of it and the presence of irrelevant time series dimensions. In this paper, we propose an unsupervised Information Gain and permutation test based change point detection method that does not require a user-defined threshold on change point scores and can accurately identify changes in a sequential setting only using a fixed short memory. Experimental results show that our efficient method improves the accuracy of change point detection compared to two benchmark methods.

Masoomeh Zameni, Zahra Ghafoori, Amin Sadri, Christopher Leckie, Kotagiri Ramamohanarao

Demos

Frontmatter

Distributed Query Engine for Multiple-Query Optimization over Data Stream

Query processing over data stream has attracted much attention in real-time applications. While many efforts have been paid for query processing of data streams in distributed environment, no previous study focused on multiple-query optimization. To address this problem, we propose EsperDist, a distributed query engine for multiple-query optimization over data stream. EsperDist can significant reduce the overhead of network transmission and memory usage by reusing operators in the query plan. Moreover, EsperDist also makes best effort to minimize the query cost so as to avoid resource bottle neck in a single machine. In this demo, we will present the architecture and work-flow of EsperDist using datasets collected from real world applications. We also propose a user-friendly to monitor query results and interact with the system in real time.

Junye Yang, Yong Zhang, Jin Wang, Chunxiao Xing

Adding Value by Combining Business and Sensor Data: An Industry 4.0 Use Case

Industry 4.0 and the Internet of Things are recent developments that have lead to the creation of new kinds of manufacturing data. Linking this new kind of sensor data to traditional business information is crucial for enterprises to take advantage of the data’s full potential. In this paper, we present a demo which allows experiencing this data integration, both vertically between technical and business contexts and horizontally along the value chain. The tool simulates a manufacturing company, continuously producing both business and sensor data, and supports issuing ad-hoc queries that answer specific questions related to the business. In order to adapt to different environments, users can configure sensor characteristics to their needs.

Guenter Hesse, Christoph Matthies, Werner Sinzig, Matthias Uflacker

AgriKG: An Agricultural Knowledge Graph and Its Applications

Recently, with the development of information and intelligent technology, agricultural production and management have been significantly boosted. But it still faces considerable challenges on how to effectively integrate large amounts of fragmented information for downstream applications. To this end, in this paper, we propose an agricultural knowledge graph, namely AgriKG, to automatically integrate the massive agricultural data from internet. By applying the NLP and deep learning techniques, AgriKG can automatically recognize agricultural entities from unstructured text, and link them to form a knowledge graph. Moreover, we illustrate typical scenarios of our AgriKG and validate it by real-world applications, such as agricultural entity retrieval, and agricultural question answering, etc.

Yuanzhe Chen, Jun Kuang, Dawei Cheng, Jianbin Zheng, Ming Gao, Aoying Zhou

KGVis: An Interactive Visual Query Language for Knowledge Graphs

With the rise of artificial intelligence, knowledge graphs have been widely recognized as a cornerstone of AI. In recent years, more and more domains have been publishing knowledge graphs in different scales. However, it is difficult for end-users to query and understand those knowledge graphs consisting of hundreds of millions of nodes and edges. To improve the availability, accessibility, and usability of knowledge graphs, we have developed an interactive visual query language, called KGVis, which can guide end-users to gradually transform query patterns into query results. Furthermore, KGVis has realized the novel capability of flexible bidirectional transformations between query patterns and query results, which can significantly assist end-users to query large-scale knowledge graphs that they are not familiar with. In this paper, we present the syntax and semantics of KGVis, discuss our design rationale behind this interactive visual query language, and demonstrate various use cases of KGVis.

Xin Wang, Qiang Fu, Jianqiang Mei, Jianxin Li, Yajun Yang

OperaMiner: Extracting Character Relations from Opera Scripts Using Deep Neural Networks

Retrieving character relations from opera scripts helps performers and audience accurately understand the features and behavior of roles. Meanwhile, discovering the evolution of character relations in an opera benefits many opera-oriented story exploration tasks. Aiming to automatically extract relations among opera characters, we demonstrate a prototype system named OperaMiner, which extracts relations for opera characters based on a hybrid deep neural network. The major features of OperaMiner are: (1) It provides a uniform reasoning framework for character relations considering language structure information as well as explicit and implicit expressions in opera scripts. (2) It explores the deep features in opera scripts, including character embeddings features, word embeddings features, and the linguistic features in artistic texts. (3) It presents a hybrid learning architecture enhancing CNN and Bi-LSTM with a CRF layer for character relation extraction. After a brief introduction to the architecture and key technologies of OperaMiner, we present a case study to demonstrate the main features of OperaMiner, including the generation of the character relation graph, the demonstration of major roles, and the evolution sequence of character relations.

Xujian Zhao, Xinnan Dai, Peiquan Jin, Hui Zhang, Chunming Yang, Bo Li

GparMiner: A System to Mine Graph Pattern Association Rules

With the rapid development, social network analysis has been receiving significant attention. One popular direction in the filed is to mine graph-pattern association rules ( $$\mathsf {GPARs}$$ ). In the demo, we present $$\mathsf {GparMiner}$$ , a system for mining $$\mathsf {GPARs}$$ , on big and distributed social networks. The system has following characteristics: (1) it supports parallel mining computation, to handle sheer size of real-life social networks; (2) it provides graphical interface to help users monitor the mining progress and have a better understanding of $$\mathsf {GPARs}$$ .

Xin Wang, Yang Xu, Ruocheng Zhao, Junjie Lin, Huayi Zhan

A Data Publishing System Based on Privacy Preservation

For data openness and sharing, we need to publish data and protect sensitive data at the same time. This paper provides the users with a system to realize privacy-preserving data publishing, which is implemented based on differential privacy. It has the following characteristics: (1) the raw data are first imported into a database and then are used to generate synthetic data for publishing; (2) a user can choose different privacy preservation levels for the synthetic data; (3) a subset of the attributes can been chosen to be synthesized while keeping the others untouched.

Zhihui Wang, Yun Zhu, Xuchen Zhou

Privacy as a Service: Publishing Data and Models

The main obstacle to the development of sustainable and productive ecosystems leveraging data is the unavailability of robust, reliable and convenient privacy management tools and services. We propose to demonstrate our Privacy-as-a-Service system and Liánchéng, the Cloud system that hosts it. We consider not only the publication of data but also that of models created by parametric and non-parametric statistical machine learning algorithms. We illustrate the construction and execution of privacy preserving workflows using real-world datasets.

Ashish Dandekar, Debabrota Basu, Thomas Kister, Geong Sen Poh, Jia Xu, Stéphane Bressan

Dynamic Bus Route Adjustment Based on Hot Bus Stop Pair Extraction

The crowdedness of buses caused by limited public transportation capacity has already severely influenced the convenience and comfort of inhabitant trip. Existing measures that reducing dispatching interval and replenishing more buses can soothe this case while aggravate traffic jam. To address the issue of inconvenient public transit characterized by packed buses, we propose a data-driven route adjustment framework, called Dynamic Bus Line Adjustment System, to recommend new operating route for the existing bus line by building direct route between extracted hot bus stop pair. DBLAS mainly involves extracting hot bus stop pair based on passenger volume estimation, and planning optimal route between hot bus stop pair using taxi traces. Finally, we develop a demo system to demonstrate the effectiveness of DBLAS.

Jiaye Liu, Jiali Mao, YunTao Du, Lishen Zhao, Zhao Zhang

DHDSearch: A Framework for Batch Time Series Searching on MapReduce

We present DHDSearch, a framework for distributed batch time series searching on MapReduce. DHDSearch is based on a two-layer DHDTree. The upper DHDTree serves as a route tree to distribute the time series. While the lower DHDTrees serve the batch searching in parallel. Compared with traditional time series searching methods, DHDSearch has better scalability and efficiency.

Zhongsheng Li, Qiuhong Li, Wei Wang, Yang Wang, Yimin Liu

Bus Stop Refinement Based on Hot Spot Extraction

During rush hour, numerous residents travel to their destinations by a multi-mode transfer way (e.g. bus & taxi) due to lack of direct buses, which sharply increases trip expense and even heavy traffic. The root of such inconvenience in bus service is that obsolete and incorrect bus stop information cannot satisfy residents’ time-dependent travel demand. In this work, we put forward a framework, called BSRF, to optimize the existing bus route using the mined bus stops from trajectory data of taxis’ short-haul order, including identifying candidate bus stop based on hot drop-off point and matching new bus stop with the existing bus line. We build a demo system to showcase the effectiveness of BSRF, which can offer reliable suggestion on bus stop setting for public transport companies.

Yilian Xin, Jiali Mao, Simin Yu, Minxi Li, Cheqing Jin

Adaptive Transaction Scheduling for Highly Contended Workloads

Traditional transaction scheduling mechanism—which is a key component in database systems—slows down the performance of concurrency control greatly in such environments for highly contended workloads. Obviously, to address this issue, there are two effective methods: (1) avoiding concurrent transactions that access the same high-contention tuple at the same time; (2) accelerating the execution of these high-contention transactions. In this demonstration, we present a new transaction scheduling mechanism, which aims to achieve the above goals. An adaptive group of first-class queues is introduced, where each queue is allocated to a specified worker thread and takes charge of transactions accessing specified high-contention tuples. We implement a system prototype and demonstrate that our transaction scheduling mechanism can effectively reduce the abort ratio of high-contention transactions and improve the system throughput dramatically.

Jixin Wang, Jinwei Guo, Huan Zhou, Peng Cai, Weining Qian

IMOptimizer: An Online Interactive Parameter Optimization System Based on Big Data

Intelligent manufactory is a typical application of big data analysis. Flexible production line is an essential fundamental of intelligent manufactory. Producing different types of similar products alternately in one line with fixed stations but varying parameters is a typical kind of flexibility. In this case, the quality of products is directly determined by the parameter setting. However, the relation between parameters and product quality are too complicated to model. Consequently, current solution is bound to tune the parameters manually, which highly relies on expertise and is very costly. Inspired by recommender systems, we develop IMOptimizer, a novel online interactive processing parameter setting system. IMOptimizer holds the features of Configurable, Interactive, High Efficiency and Friendly UI. To the best of our knowledge, our system is the first big-data-driven generic platform focusing on online process optimization. In this demonstration, we will present our prototype.

Zhiyu Liang, Hongzhi Wang, Jianzhong Li, Hong Gao

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise