Skip to main content
Top

Database Systems for Advanced Applications

25th International Conference, DASFAA 2020, Jeju, South Korea, September 24–27, 2020, Proceedings, Part III

  • 2020
  • Book
insite
SEARCH

About this book

The 4 volume set LNCS 12112-12114 constitutes the papers of the 25th International Conference on Database Systems for Advanced Applications which will be held online in September 2020.

The 119 full papers presented together with 19 short papers plus 15 demo papers and 4 industrial papers in this volume were carefully reviewed and selected from a total of 487 submissions.

The conference program presents the state-of-the-art R&D activities in database systems and their applications. It provides a forum for technical presentations and discussions among database researchers, developers and users from academia, business and industry.

Table of Contents

Frontmatter

Social Network

Frontmatter
Sequential Multi-fusion Network for Multi-channel Video CTR Prediction

In this work, we study video click-through rate (CTR) prediction, crucial for the refinement of video recommendation and the revenue of video advertising. Existing studies have verified the importance of modeling users’ clicked items as their latent preference for general click-through rate prediction. However, all of the clicked ones are equally treated in the input stage, which is not the case in online video platforms. This is because each video is attributed to one of the multiple channels (e.g., TV and MOVIES), thus having different impacts on the prediction of candidate videos from a certain channel. To this end, we propose a novel Sequential Multi-Fusion Network (SMFN) by classifying all the channels into two categories: (1) target channel which current candidate videos belong to, and (2) context channel which includes all the left channels. For each category, SMFN leverages a recurrent neural network to model the corresponding clicked video sequence. The hidden interactions between the two categories are characterized by correlating each video of a sequence with the overall representation of another sequence through a simple but effective fusion unit. The experimental results on the real datasets collected from a commercial online video platform demonstrate the proposed model outperforms some strong alternative methods.

Wen Wang, Wei Zhang, Wei Feng, Hongyuan Zha
Finding Attribute Diversified Communities in Complex Networks

Recently, finding communities by considering both structure cohesiveness and attribute cohesiveness has begun to generate considerable interest. However, existing works only consider attribute cohesiveness from the perspective of attribute similarity. No work has considered finding communities with attribute diversity, which has good use in many applications. In this paper, we study the problem of searching attribute diversified communities in complex networks. We propose a model for attribute diversified communities and investigate the problem of attribute diversified community search based on k-core. We first prove the NP-hardness of the problem, and then propose efficient branch and bound algorithms with novel effective bounds. The experiments performed on various complex network datasets demonstrate the efficiency and effectiveness of our algorithms for finding attribute diversified communities and entitle the significance of our study.

Afzal Azeem Chowdhary, Chengfei Liu, Lu Chen, Rui Zhou, Yun Yang
Business Location Selection Based on Geo-Social Networks

Location has a great impact on the success of many businesses. The existing works typically utilize the number of customers who are the Reverse Nearest Neighbors (RNN) of a business location to assess its goodness. While, with the prevalence of word-of-mouth marketing in social networks, a business can now exploit the social influence to attract enormous customers to visit it, even though it is not located in the popular but unaffordable business districts with the most RNNs.In this paper, we propose a novel Business Location Selection (BLS) approach to integrate the factors of both social influence and geographical distance. Firstly, we formally define a BLS model based on relative distance aware influence maximization in geo-social networks, where the goodness of a location is assessed by the maximum number of social network users it can influence via online propagation. To the best of our knowledge, it is the first BLS model that adopts the influence maximization techniques. Then, to speed up the selection, we present two sophisticated candidate location pruning strategies, and extend the Reverse Influence Sampling (RIS) algorithm to select seeds for multiple locations, thereby avoiding redundant computation. Lastly, we demonstrate the effectiveness and efficiency of our approach by conducting the experiments on three real geo-social networks.

Qian Zeng, Ming Zhong, Yuanyuan Zhu, Jianxin Li
SpEC: Sparse Embedding-Based Community Detection in Attributed Graphs

Community detection, also known as graph clustering, is a widely studied task to find the subgraphs (communities) of related nodes in a graph. Existing methods based on non-negative matrix factorization can solve the task of both non-overlapping community detection and overlapping community detection, but the probability vector obtained by factorization is too dense and ambiguous, and the difference between these probabilities is too small to judge which community the corresponding node belongs to. This will lead to a lack of interpretability and poor performance in community detection. Besides, there are always many sparse subgraphs in a graph, which will cause unstable iterations. Accordingly, we propose SpEC (Sparse Embedding-based Community detection) for solving the above problems. First, sparse embeddings has stronger interpretability than dense ones. Second, sparse embeddings consume less space. Third, sparse embeddings can be computed more efficiently. For traditional matrix factorization-based models, their iteration update rules do not guarantee the convergence for sparse embeddings. SpEC elaborately designs the update rules to ensure convergence and efficiency for sparse embeddings. Crucially, SpEC takes full advantage of attributed graphs and learns the neighborhood patterns, which imply inherent relationships between node attributes and topological structure information. By coupled recurrent neural networks, SpEC recovers the missing edges and predicts the relationship between pairs of nodes. In addition, SpEC ensures stable convergence and improving performance. Furthermore, the results of the experiments show that our model outperforms other state-of-the-art community detection methods.

Huidi Chen, Yun Xiong, Changdong Wang, Yangyong Zhu, Wei Wang
MemTimes: Temporal Scoping of Facts with Memory Network

This paper works on temporal scoping, i.e., adding time interval to facts in Knowledge Bases (KBs). The existing methods for temporal scope inference and extraction still suffer from low accuracy. In this paper, we propose a novel neural model based on Memory Network to do temporal reasoning among sentences for the purpose of temporal scoping. We design proper ways to encode both semantic and temporal information contained in the mention set of each fact, which enables temporal reasoning with Memory Network. We also find ways to remove the effect brought by noisy sentences, which can further improve the robustness of our approach. The experiments show that this solution is highly effective for detecting temporal scope of facts.

Siyuan Cao, Qiang Yang, Zhixu Li, Guanfeng Liu, Detian Zhang, Jiajie Xu
Code2Text: Dual Attention Syntax Annotation Networks for Structure-Aware Code Translation

Translating source code into natural language text helps people understand the computer program better and faster. Previous code translation methods mainly exploit human specified syntax rules. Since handcrafted syntax rules are expensive to obtain and not always available, a PL-independent automatic code translation method is much more desired. However, existing sequence translation methods generally regard source text as a plain sequence, which is not competent to capture the rich hierarchical characteristics inherently reside in the code. In this work, we exploit the abstract syntax tree (AST) that summarizes the hierarchical information of a code snippet to build a structure-aware code translation method. We propose a syntax annotation network called Code2Text to incorporate both source code and its AST into the translation. Our Code2Text features the dual encoders for the sequential input (code) and the structural input (AST) respectively. We also propose a novel dual-attention mechanism to guide the decoding process by accurately aligning the output words with both the tokens in the source code and the nodes in the AST. Experiments on a public collection of Python code demonstrate that Code2Text achieves better performance compared to several state-of-the-art methods, and the generation of Code2Text is accurate and human-readable.

Yun Xiong, Shaofeng Xu, Keyao Rong, Xinyue Liu, Xiangnan Kong, Shanshan Li, Philip Yu, Yangyong Zhu
Semantic Enhanced Top-k Similarity Search on Heterogeneous Information Networks

Similarity search on heterogeneous information networks has attracted widely attention from both industrial and academic areas in recent years, for example, used as friend detection in social networks and collaborator recommendation in coauthor networks. The structure information on the heterogeneous information network can be captured by multiple meta paths and people usually utilized meta paths to design method for similarity search. The rich semantics in the heterogeneous information networks is not only its structure information, the content stored in nodes is also an important element. However, the content similarity of nodes was usually not valued in the existing methods. Although recently some researchers consider both of information in machine learning-based methods for similarity search, they used structure and content information separately. To address this issue by balancing the influence of structure and content information flexibly in the process of searching, we propose a double channel convolutional neural networks model for top-k similarity search, which uses path instances as model inputs, and generates structure and content embeddings for nodes based on different meta paths. Moreover, we utilize two attention mechanisms to enhance the differences of meta path for each node and combine the content and structure information of nodes for comprehensive representation. The experimental results showed our search algorithm can effectively support top-k similarity search in heterogeneous information networks and achieved higher performance than existing approaches.

Minghe Yu, Yun Zhang, Tiancheng Zhang, Ge Yu
STIM: Scalable Time-Sensitive Influence Maximization in Large Social Networks

Influence maximization, aiming to select k seed users to influence the rest of users maximally, is a fundamental problem in social networks. Due to its well-known NP-hardness, great efforts have been devoted to developing scalable algorithms in the literature. However, the scalability issue is still not well solved in the time-sensitive influence maximization problem when propagation incurs a certain amount of time delay and only be valid before a deadline constraint, because all possible time delays need to be enumerated along each edge in a path to calculate the influence probability. Existing approaches usually adopt a path-based search strategy to enumerate all the possible influence spreading paths for a single path, which are computationally expensive for large social networks. In this paper, we propose a novel scalable time-sensitive influence maximization method, STIM, based on time-based search that can avoid a large number of repeated visits of the same subpaths and compute the influence probability more efficiently. Furthermore, based on time-based search, we also derive a new upper bound to estimate the marginal influence spread efficiently. Extensive experiments on real-world networks show that STIM is more space and time-efficient compared with existing state-of-the-art methods while still preserving the influence spread quality in real-world large social networks.

Yuanyuan Zhu, Kailin Ding, Ming Zhong, Lijia Wei
Unsupervised Hierarchical Feature Selection on Networked Data

Networked data is commonly observed in many high-impact domains, ranging from social networks, collaboration platforms to biological systems. In such systems, the nodes are often associated with high dimensional features while remain connected to each other through pairwise interactions. Recently, various unsupervised feature selection methods have been developed to distill actionable insights from such data by finding a subset of relevant features that are highly correlated with the observed node connections. Although practically useful, those methods predominantly assume that the nodes on the network are organized in a flat structure, which is rarely the case in reality. In fact, the nodes in most, if not all, of the networks can be organized into a hierarchical structure. For example, in a collaboration network, researchers can be clustered into different research areas at the coarsest level and are further specified into different sub-areas at a finer level. Recent studies have shown that such hierarchical structure can help advance various learning problems including clustering and matrix completion. Motivated by the success, in this paper, we propose a novel unsupervised feature selection framework (HNFS) on networked data. HNFS can simultaneously learn the implicit hierarchical structure among the nodes and embed the hierarchical structure into the feature selection process. Empirical evaluations on various real-world datasets validate the superiority of our proposed framework.

Yuzhe Zhang, Chen Chen, Minnan Luo, Jundong Li, Caixia Yan, Qinghua Zheng
Aspect Category Sentiment Analysis with Self-Attention Fusion Networks

Aspect category sentiment analysis (ACSA) is a subtask of aspect based sentiment analysis (ABSA). It aims to identify sentiment polarities of predefined aspect categories in a sentence. ACSA has received significant attention in recent years for the vast amount of online reviews toward the target. Existing methods mainly make use of the emerging architecture like LSTM, CNN and the attention mechanism to focus on the informative sentence spans towards the aspect category. However, they do not pay much attention to the fusion of the aspect category and the corresponding sentence, which is important for the ACSA task. In this paper, we focus on the deep fusion of the aspect category and the corresponding sentence to improve the performance of sentiment classification. A novel model, named Self-Attention Fusion Networks (SAFN) is proposed. First, the multi-head self-attention mechanism is utilized to obtain the sentence and the aspect category attention feature representation separately. Then, the multi-head attention mechanism is used again to fuse these two attention feature representations deeply. Finally, a convolutional layer is applied to extract informative features. We conduct experiments on a dataset in Chinese which is collected from an online automotive product forum, and a public dataset in English, Laptop-2015 from SemEval 2015 Task 12. The experimental results demonstrate that our model achieves higher effectiveness and efficiency with substantial improvement.

Zelin Huang, Hui Zhao, Feng Peng, Qinhui Chen, Gang Zhao

Query Processing

Frontmatter
A Partial Materialization-Based Approach to Scalable Query Answering in OWL 2 DL

This paper focuses on the efficient ontology-mediated querying (OMQ) problem. Compared with query answering in plain databases, which deals with fixed finite database instances, a key challenge in OMQ is to deal with the possibly infinite large set of consequences entailed by the ontology, i.e., the so-called chase. Existing techniques mostly avoid materializing the chase by query rewriting to address this issue, which, however, comes at the cost of query rewriting and query evaluation at runtime, and the possibility of missing optimization opportunity at the data level. Instead, pure materialization technology is adopted in this paper. The query-rewriting is unnecessary at materialization. A query analysis algorithm (QAA) is proposed for ensuring the completeness and soundness of OMQ over partial materialization for rooted queries in $$\textit{DL-Lite}^{\mathcal {N}}_{ horn } $$ . We also soundly and incompletely expand our method to deal with OWL 2 DL. Finally, we implement our approach as a prototype system SUMA by integrating off-the-shelf efficient SPARQL query engines. The experiments show that SUMA is complete on each test ontology and each test query, which is the same as Pellet and outperforms PAGOdA. In addition, SUMA is highly scalable on large data sets.

Xiaoyu Qin, Xiaowang Zhang, Muhammad Qasim Yasin, Shujun Wang, Zhiyong Feng, Guohui Xiao
DeepQT : Learning Sequential Context for Query Execution Time Prediction

Query Execution Time Prediction is an important and challenging problem in the database management system. It is even more critical for a distributed database system to effectively schedule the query jobs in order to maximize the resource utilization and minimize the waiting time of users based on the query execution time prediction. While a number of works have explored this problem, they mostly ignore the sequential context of query jobs, which may affect the performance of prediction significantly. In this work, we propose a novel Deep learning framework for Query execution Time prediction, called DeepQT, in which the sequential context of a query job and other features at the same time are learned to improve the performance of prediction through jointly training a recurrent neural network and a deep feed-forward network. The results of the experiments conducted on two datasets of a commercial distributed computing platform demonstrate the superiority of our proposed approach.

Jingxiong Ni, Yan Zhao, Kai Zeng, Han Su, Kai Zheng
DARS: Diversity and Distribution-Aware Region Search

Recent years have seen the rapid development of Location Based Services (LBSs). Many users of these services are making use of them to, for example, plan trips, find houses or explore their surroundings. In this paper we introduce a novel problem called the diversity and distribution-aware region search (DARS) problem. In particular, DARS aims to find regions of size $$a \times b$$ where the number of different categories is maximized such that objects of different categories are not too scattered from each other and objects of the same category are within reasonable distance (which is a tunable parameter to cater for different users’ needs). We propose several methods to tackle the problem. We first design a sweepline based method, and then design various techniques to further improve the efficiency. We have conducted extensive experiments over real datasets and demonstrate both the usefulness and the efficiency of our methods.

Siyu Liu, Qizhi Liu, Zhifeng Bao
I/O Efficient Algorithm for c-Approximate Furthest Neighbor Search in High-Dimensional Space

Furthest Neighbor search in high-dimensional space has been widely used in many applications such as recommendation systems. Because of the “curse of dimensionality” problem, c-approximate furthest neighbor (C-AFN) is a substitute as a trade-off between result accuracy and efficiency. However, most of the current techniques for external memory are only suitable for low-dimensional space.In this paper, we propose a novel algorithm called reverse incremental LSH based on Indyk’s LSH scheme to solve the problem with theoretical guarantee. Unlike the previous methods using hashing scheme, reverse incremental LSH (RI-LSH) is designed for external memory and can achieve a good performance on I/O cost. We provide rigorous theoretical analysis to prove that RI-LSH can return a $$c$$ -AFN result with a constant possibility. Our comprehensive experiment results show that, compared with other $$c$$ -AFN methods with theoretical guarantee, our algorithm can achieve better I/O efficiency.

Wanqi Liu, Hanchen Wang, Ying Zhang, Lu Qin, Wenjie Zhang
An Efficient Approximate Algorithm for Single-Source Discounted Hitting Time Query

Given a graph G, a source node s and a target node t, the discounted hitting time (DHT) of t with respect to s is the expected steps that a random walk starting from s visits t for the first time. For a query node s, the single-source DHT (SSDHT) query returns the top-k nodes with the highest DHT values from all nodes in G. SSDHT is widely adopted in many applications such as query suggestion, link prediction, local community detection, graph clustering and so on. However, existing methods for SSDHT suffer from high computational costs or no guaranty of the results. In this paper, we propose FBRW, an effective SSDHT algorithm to compute the value of DHT with guaranteed results. We convert DHT to the ratio of personalized PageRank values. By combining Forward Push, Backward propagation and Random Walk, FBRW first evaluates personalized PageRank values then returns DHT values with low time complexity. To our knowledge, this is the first time to compute SSDHT with personalized PageRank. Extensive experiments demonstrate that FBRW is significantly ahead of the existing methods with promising effectiveness at the same time.

Kaixin Liu, Yong Zhang, Chunxiao Xing
Path Query Processing Using Typical Snapshots in Dynamic Road Networks

The shortest path query in road network is a fundamental operation in navigation and location-based services. The existing shortest path algorithms aim at improving efficiency in the static/time-dependent environment. However, the real-life road networks are dynamic, so they can hardly meet the requirement in practice. In this paper, we aim to support the path query in dynamic road networks by identifying the typical snapshots from the snapshot sequences, building the path indexes on them, and finally processing the query with the most suitable typical snapshot. Specifically, we first use the typical OD pairs to capture the dynamic information and represent the snapshots. Then the snapshot similarity is measured by considering the shortest path error and the shortest path similarity of these OD pairs. Because the OD pair number is huge and they have different power in capturing the traffic condition, we further propose a hot region-based OD selection method that could select a small but powerful OD set. Lastly, we use the distance-based $$\chi $$ -quantile error for the query accuracy evaluation and conduct experiments in a large real-world dynamic road network to verify the effectiveness of our method compared with the state-of-the-art.

Mengxuan Zhang, Lei Li, Pingfu Chao, Wen Hua, Xiaofang Zhou
Dynamic Dimension Indexing for Efficient Skyline Maintenance on Data Streams

Skyline computation receives much attention in research and application domains, for which many algorithms have been developed during decades. However, maintaining the skyline in data streams is much challenging because of the continuous updates of skyline with respect to non stop adding of incoming tuples and removing of expired tuples. In this paper, we present a dynamic dimension indexing based approach RSS to skyline computation on high dimensional data streams, which is efficient at both count-based and time-based sliding windows regardless the dimensionality of data. Our analysis shows that the time complexity of RSS is bounded by a subset of the instant skyline, and our evaluation shows the efficiency of RSS on both of low and high dimensional data streams.

Rui Liu, Dominique Li
SCALE: An Efficient Framework for Secure Dynamic Skyline Query Processing in the Cloud

It is now cost-effective to outsource large dataset and perform query over the cloud. However, in this scenario, there exist serious security and privacy issues that sensitive information contained in the dataset can be leaked. The most effective way to address that is to encrypt the data before outsourcing. Nevertheless, it remains a grand challenge to process queries in ciphertext efficiently. In this work, we shall focus on solving one representative query task, namely dynamic skyline query, in a secure manner over the cloud. However, it is difficult to be performed on encrypted data as its dynamic domination criteria require both subtraction and comparison, which cannot be directly supported by a single encryption scheme efficiently. To this end, we present a novel framework called scale. It works by transforming traditional dynamic skyline domination into pure comparisons. The whole process can be completed in single-round interaction between user and the cloud. We theoretically prove that the outsourced database, query requests, and returned results are all kept secret under our model. Empirical study over a series of datasets demonstrates that our framework improves the efficiency of query processing by nearly three orders of magnitude compared to the state-of-the-art.

Weiguo Wang, Hui Li, Yanguo Peng, Sourav S. Bhowmick, Peng Chen, Xiaofeng Chen, Jiangtao Cui
Authenticated Range Query Using SGX for Blockchain Light Clients

Due to limited computing and storage resources, light clients and full nodes coexist in a typical blockchain system. Any query from light clients must be forwarded to full nodes for execution, and light clients verify the integrity of query results returned. Since existing authenticated query based on Authenticated Data Structure (ADS) suffers from significant network, storage and computing overheads by virtue of Verification Objects (VO), an alternative way turns to Trust Execution Environment (TEE), with which light clients have no need to receive or verify any VO. However, state-of-the-art TEE cannot deal with large-scale application conveniently due to limited secure memory space (i.e, the size of enclave in Intel SGX is only 128MB). Hence, we organize data hierarchically in both trusted (enclave) and untrusted memory and only buffer hot data in enclave to reduce page swapping overhead between two kinds of memory. Security analysis and empirical study validate the effectiveness of our proposed solutions.

Qifeng Shao, Shuaifeng Pang, Zhao Zhang, Cheqing Jing
Stargazing in the Dark: Secure Skyline Queries with SGX

Skylining for multi-criteria decision making is widely applicable and often involves sensitive data that should be encrypted, especially when the database and query engine are outsourced to an untrusted cloud platform. The state-of-the-art designs (ICDE’17) of skylining over encrypted data, while relying on two non-colluding servers, are still slow – taking around three hours to get the skyline for 9000 2-D points.This paper proposes a very efficient solution with a trusted processor such as SGX. A challenge is to support dynamic queries while keeping the memory footprint small and simultaneously preventing unintended leakage with only lightweight cryptographic primitives. Our proposed approach iteratively loads data to the memory-limited SGX on-demand and builds a binary-tree-like index for logarithmic query time. For millions of points, we gain $${6000} - {28000}\times $$ improvement in query time (ICDE’17).

Jiafan Wang, Minxin Du, Sherman S. M. Chow
Increasing the Efficiency of GPU Bitmap Index Query Processing

Once exotic, computational accelerators are now commonly available in many computing systems. Graphics processing units (GPUs) are perhaps the most frequently encountered computational accelerators. Recent work has shown that GPUs are beneficial when analyzing massive data sets. Specifically related to this study, it has been demonstrated that GPUs can significantly reduce the query processing time of database bitmap index queries. Bitmap indices are typically used for large, read-only data sets and are often compressed using some form of hybrid run-length compression.In this paper, we present three GPU algorithm enhancement strategies for executing queries of bitmap indices compressed using Word Aligned Hybrid compression: 1) data structure reuse 2) metadata creation with various type alignment and 3) a preallocated memory pool. The data structure reuse greatly reduces the number of costly memory system calls. The use of metadata exploits the immutable nature of bitmaps to pre-calculate and store necessary intermediate processing results. This metadata reduces the number of required query-time processing steps. Preallocating a memory pool can reduce or entirely remove the overhead of memory operations during query processing. Our empirical study showed that performing a combination of these strategies can achieve 33 $$\times $$ to 113 $$\times $$ speedup over the unenhanced implementation.

Brandon Tran, Brennan Schaffner, Jason Sawin, Joseph M. Myre, David Chiu
An Effective and Efficient Re-ranking Framework for Social Image Search

With the rapidly increasing popularity of social media websites, large numbers of images with user-annotated tags are uploaded by web users. Developing automatic techniques to retrieval such massive social images attracts much attention of researchers. The method of social image search returns top-k images according to several keywords input by users. However, the returned results by existing methods are usually irrelevant or lack of diversity, which cannot satisfy user’s veritable intention. In this paper, we propose an effective and efficient re-ranking framework for social image search, which can quickly and accurately return ranking results. We not only consider the consistency of visual content of images and semantic interpretations of tags, but also maximize the coverage of the user’s query demand. Specifically, we first build a social relationship graph by exploring the heterogeneous attribute information of social networks. For a given query, to ensure the effectiveness, we execute an efficient keyword search algorithm over the social relationship graph, and obtain top-k relevant candidate results. Moreover, we propose a novel re-ranking optimization strategy to refine the candidate results. Meanwhile, we develop an index to accelerate the optimization process, which ensures the efficiency of our framework. Extensive experimental conducts on real-world datasets demonstrate the effectiveness and efficiency of proposed re-ranking framework.

Bo Lu, Ye Yuan, Yurong Cheng, Guoren Wang, Xiaodong Duan
HEGJoin: Heterogeneous CPU-GPU Epsilon Grids for Accelerated Distance Similarity Join

The distance similarity join operation joins two datasets (or tables), A and B, based on a search distance, $$\epsilon $$ , ( $$A \ltimes _\epsilon B$$ ), and returns the pairs of points ( $$p_a$$ , $$p_b$$ ), where $$p_a \in A$$ and $$p_b \in B$$ such that the distance between $$p_a$$ and $$p_b$$ $$\le \epsilon $$ . In the case where $$A = B$$ , then this operation is a similarity self-join (and therefore, $$A \bowtie _\epsilon A$$ ). In contrast to the majority of the literature that focuses on either the CPU or the GPU, we propose in this paper Heterogeneous CPU-GPU Epsilon Grids Join (HEGJoin), an efficient algorithm to process a distance similarity join using both the CPU and the GPU. We leverage two state-of-the-art algorithms: LBJoin for the GPU and Super-EGO for the CPU. We achieve good load balancing between architectures by assigning points with larger workloads to the GPU and those with lighter workloads to the CPU through the use of a shared work queue. We examine the performance of our heterogeneous algorithm against LBJoin, as well as Super-EGO by comparing performance to the upper bound throughput. We observe that HEGJoin consistently achieves close to this upper bound.

Benoit Gallet, Michael Gowanlock
String Joins with Synonyms

String matching is a fundamental operation in many applications such as data integration, information retrieval and text mining. Since users express the same meaning in a variety of ways that are not textually similar, existing works have proposed variants of Jaccard similarity by using synonyms to consider semantics beyond textual similarities. However, they may produce a non-negligible number of false positives in some applications by employing set semantics and miss some true positives due to approximations. In this paper, we define new match relationships between a pair of strings under synonym rules and develop an efficient algorithm to verify the match relationships for a pair of strings. In addition, we propose two filtering methods to prune non-matching string pairs. We also develop join algorithms with synonyms based on the filtering methods and the match relationships. Experimental results with real-life datasets confirm the effectiveness of our proposed algorithms.

Gwangho Song, Hongrae Lee, Kyuseok Shim, Yoonjae Park, Wooyeol Kim
Efficient Query Reverse Engineering Using Table Fragments

Given an output table T that is the result of some unknown query on a database D, Query Reverse Engineering (QRE) computes one or more target query Q such that the result of Q on D is T. A fundamental challenge in QRE is how to efficiently compute target queries given its large search space. In this paper, we focus on the QRE problem for PJ $$^+$$ queries, which is a more expressive class of queries than project-join queries by supporting antijoins as well as inner joins. To enhance efficiency, we propose a novel query-centric approach consisting of table partitioning, precomputation, and indexing techniques. Our experimental study demonstrates that our approach significantly outperforms the state-of-the-art solution by an average improvement factor of 120.

Meiying Li, Chee-Yong Chan

Embedding Analysis

Frontmatter
Decentralized Embedding Framework for Large-Scale Networks

Network embedding aims to learn vector representations of vertices, that preserve both network structures and properties. However, most existing embedding methods fail to scale to large networks. A few frameworks have been proposed by extending existing methods to cope with network embedding on large-scale networks. These frameworks update the global parameters iteratively or compress the network while learning vector representation. Such network embedding schemes inevitably lead to a high cost of either high communication overhead or sub-optimal embedding quality. In this paper, we propose a novel decentralized large-scale network embedding framework called DeLNE. As the name suggests, DeLNE divides a network into smaller partitions and learn vector representation in a distributed fashion, avoiding any unnecessary communication overhead. Our proposed framework uses Variational Graph Convolution Auto-Encoders to embed the structure and properties of each sub-network. Secondly, we propose an embedding aggregation mechanism, that captures the global properties of each node. Thirdly, we propose an alignment function, that reconciles all sub-networks embedding into the same vector space. Due to the parallel nature of DeLNE, it scales well on large clustered environments. Through extensive experimentation on realistic datasets, we show that DeLNE produces high-quality embedding and outperforms existing large-scale network embeddings frameworks, in terms of both efficiency and effectiveness.

Mubashir Imran, Hongzhi Yin, Tong Chen, Yingxia Shao, Xiangliang Zhang, Xiaofang Zhou
SOLAR: Fusing Node Embeddings and Attributes into an Arbitrary Space

Network embedding has attracted lots of attention in recent years. It learns low-dimensional representations for network nodes, which benefits many downstream tasks such as node classification and link prediction. However, most of the existing approaches are designed for a single network scenario. In the era of big data, the related information from different networks should be fused together to facilitate applications. In this paper, we study the problem of fusing the node embeddings and incomplete node attributes provided by different networks into an arbitrary space. Specifically, we first propose a simple but effective inductive method by learning the relationships among node embeddings and the given attributes. Then, we propose its transductive variant by jointly considering the node embeddings and incomplete attributes. Finally, we introduce its deep transductive variant based on deep AutoEncoder. Experimental results on four datasets demonstrate the superiority of our methods.

Zheng Wang, Jian Cui, Yingying Chen, Changjun Hu
Detection of Wrong Disease Information Using Knowledge-Based Embedding and Attention

International Classification of Diseases (ICD) code has always been an important component in electronic health record (EHR). The coding errors in ICD have an extremely negative effect on the subsequent analysis using EHR. Due to some diseases been viewed as a stigma, doctors, despite having made the right diagnosis and prescribed the right drugs, would choose some diseases that symptom similarity instead of the real diseases to help patients, such as using febrile convulsions instead of epilepsy. In order to detect the wrong disease information in EHR, in this paper, we propose a method using the structured information of medications to correct the code assignments. This approach is novel and useful because patients’ medications must be carefully prescribed without any bias. Specifically, we employ the Knowledge-based Embedding to help medications to get better representation and the Self-Attention Mechanism to capture the relations between medications in our proposed model. We conduct experiments on a real-world dataset, which comprises more than 300,000 medical records of over 40,000 patients. The experimental results achieve 0.972 in the AUC score, which outperforms the baseline methods and has good interpretability.

Wei Ge, Wei Guo, Lizhen Cui, Hui Li, Lijin Liu
Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning

Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models.

Qiao Jin, Haoyang Ding, Linfeng Li, Haitao Huang, Lei Wang, Jun Yan
Semantic Disambiguation of Embedded Drug-Disease Associations Using Semantically Enriched Deep-Learning Approaches

State-of-the-art approaches in the field of neural-embedding models (NEMs) enable progress in the automatic extraction and prediction of semantic relations between important entities like active substances, diseases, and genes. In particular, the prediction property is making them valuable for important research-related tasks such as hypothesis generation and drug-repositioning. A core challenge in the biomedical domain is to have interpretable semantics from NEMs that can distinguish, for instance, between the following two situations: a) drug $$ x $$ induces disease $$ y $$ and b) drug $$ x $$ treats disease $$ y $$ . However, NEMs alone cannot distinguish between associations such as treats or induces. Is it possible to develop a model to learn a latent representation from the NEMs capable of such disambiguation? To what extent do we need domain knowledge to succeed in the task? In this paper, we answer both questions and show that our proposed approach not only succeeds in the disambiguation task but also advances current growing research efforts to find real predictions using a sophisticated retrospective analysis.

Janus Wawrzinek, José María González Pinto, Oliver Wiehr, Wolf-Tilo Balke

Recommendation

Frontmatter
Heterogeneous Graph Embedding for Cross-Domain Recommendation Through Adversarial Learning

Cross-domain recommendation is critically important to construct a practical recommender system. The challenges of building a cross-domain recommender system lie in both the data sparsity issue and lacking of sufficient semantic information. Traditional approaches focus on using the user-item rating matrix or other feedback information, but the contents associated with the objects like reviews and the relationships among the objects are largely ignored. Although some works merge the content information and the user-item rating network structure, they only focus on using the attributes of the items but ignore user generated contents such as reviews. In this paper, we propose a novel cross-domain recommender framework called ECHCDR (Embedding content and heterogeneous network for cross-domain recommendation), which contains two major steps of content embedding and heterogeneous network embedding. By considering the contents of objects and their relationships, ECHCDR can effectively alleviate the data sparsity issue. To enrich the semantic information, we construct a weighted heterogeneous network whose nodes are users and items of different domains. The weight of link is defined by an adjacency matrix and represents the similarity between users, books and movies. We also propose to use adversarial training method to learn the embeddings of users and cross-domain items in the constructed heterogeneous graph. Experimental results on two real-world datasets collected from Amazon show the effectiveness of our approach compared with state-of-art recommender algorithms.

Jin Li, Zhaohui Peng, Senzhang Wang, Xiaokang Xu, Philip S. Yu, Zhenyun Hao
Hierarchical Variational Attention for Sequential Recommendation

Attention mechanisms have been successfully applied in many fields, including sequential recommendation. Existing recommendation methods often use the deterministic attention network to consider latent user preferences as fixed points in low-dimensional spaces. However, the fixed-point representation is not sufficient to characterize the uncertainty of user preferences that prevails in recommender systems. In this paper, we propose a new Hierarchical Variational Attention Model (HVAM), which employs variational inference to model the uncertainty in sequential recommendation. Specifically, the attention vector is represented as density by imposing a Gaussian distribution rather than a fixed point in the latent feature space. The variance of the attention vector measures the uncertainty associated with the user’s preference representation. Furthermore, the user’s long-term and short-term preferences are captured through a hierarchical variational attention network. Finally, we evaluate the proposed model HVAM using two public real-world datasets. The experimental results demonstrate the superior performance of our model comparing to the state-of-the-art methods for sequential recommendation.

Jing Zhao, Pengpeng Zhao, Yanchi Liu, Victor S. Sheng, Zhixu Li, Lei Zhao
Mutual Self Attention Recommendation with Gated Fusion Between Ratings and Reviews

Product ratings and reviews can provide rich useful information of users and items, and are widely used in recommender systems. However, it is nontrivial to infer user preference according to the history behaviors since users always have different and complicated interests for different target items. Hence, precisely modelling the context interaction of users and target items is important in learning their representations. In this paper, we propose a unified recommendation method with mutual self attention mechanism to capture the inherent interactions between users and items from reviews and ratings. In our method, we design a review encoder based on mutual self attention to extract the semantic features of users and items from their reviews. The mutual self attention could capture the interaction information of different words between the user review and item review automatically. Another rating-based encoder is utilized to learn representations of users and items from the rating patterns. Besides, we propose a neural gated network to effectively fuse the review- and rating-based features as the final comprehensive representations of users and items for rating prediction. Extensive experiments are conducted on real world recommendation datasets to evaluate our method. The experiments show that our method can achieve better recommendation performance compared with many competitive baselines.

Qiyao Peng, Hongtao Liu, Yang Yu, Hongyan Xu, Weidi Dai, Pengfei Jiao
Modeling Periodic Pattern with Self-Attention Network for Sequential Recommendation

Repeat consumption is a common phenomenon in sequential recommendation tasks, where a user revisits or repurchases items that (s)he has interacted before. Previous researches have paid attention to repeat recommendation and made great achievements in this field. However, existing studies rarely considered the phenomenon that the consumers tend to show different behavior periodicities on different items, which is important for recommendation performance. In this paper, we propose a holistic model, which integrates Graph Convolutional Network with Periodic-Attenuated Self-Attention Network (GPASAN) to model user’s different behavior patterns for a better recommendation. Specifically, we first process all the users’ action sequences to construct a graph structure, which captures the complex item connection and obtains item representations. Then, we employ a periodic channel and an attenuated channel that incorporate temporal information into the self-attention mechanism to model the user’s periodic and novel behaviors, respectively. Extensive experiments conducted on three public datasets show that our proposed model outperforms the state-of-the-art methods consistently.

Jun Ma, Pengpeng Zhao, Yanchi Liu, Victor S. Sheng, Jiajie Xu, Lei Zhao
Cross-Domain Recommendation with Adversarial Examples

Cross-domain recommendation leverages the knowledge from relevant domains to alleviate the data sparsity issue. However, we find that the state-of-the-art cross-domain models are vulnerable to adversarial examples, leading to possibly large errors in generalization. That’s because most methods rarely consider the robustness of the proposed models. In this paper, we propose a new Adversarial Cross-Domain Network (ACDN), in which adversarial examples are dynamically generated to train the cross-domain recommendation model. Specifically, we first combine two multilayer perceptrons by sharing the user embedding matrix as our base model. Then, we add small but intentionally worst-case perturbations on the model embedding representations to construct adversarial examples, which can result in the model outputting an incorrect answer with a high confidence. By training with these aggressive examples, we are able to obtain a robust cross-domain model. Finally, we evaluate the proposed model on two large real-world datasets. Our experimental results show that our model significantly outperforms the state-of-the-art methods on cross-domain recommendation.

Haoran Yan, Pengpeng Zhao, Fuzhen Zhuang, Deqing Wang, Yanchi Liu, Victor S. Sheng
DDFL: A Deep Dual Function Learning-Based Model for Recommender Systems

Over the last two decades, latent-based collaborative filtering (CF) has been extensively studied in recommender systems to match users with appropriate items. In general, CF can be categorized into two types: matching function learning-based CF and representation learning-based CF. Matching function-based CF uses a multi-layer perceptron to learn the complex matching function that maps user-item pairs to matching scores, while representation learning-based CF maps users and items into a common latent space and adopts dot product to learn their relationship. However, the dot product is prone to overfitting and does not satisfy the triangular inequality. Different from latent based CF, metric learning represents user and item into a low dimensional space, measures their explicit closeness by Euclidean distance and satisfies the triangular inequality. In this paper, inspired by the success of metric learning, we supercharge metric learning with non-linearities and propose a Metric Function Learning (MeFL) model to learn the function that maps user-item pairs to predictive scores in the metric space. Moreover, to learn the mapping more comprehensively, we further combine MeFL with a matching function learning model into a unified framework and name this new model Deep Dual Function Learning (DDFL). Extensive empirical results on four benchmark datasets are conducted and the results verify the effectiveness of MeFL and DDFL over state-of-the-art models for implicit feedback prediction.

Syed Tauhid Ullah Shah, Jianjun Li, Zhiqiang Guo, Guohui Li, Quan Zhou
Zero-Injection Meets Deep Learning: Boosting the Accuracy of Collaborative Filtering in Top-N Recommendation

Zero-Injection has been known to be very effective in alleviating the data sparsity problem in collaborative filtering (CF), owing to its idea of finding and exploiting uninteresting items as users’ negative preferences. However, this idea has been only applied to the linear CF models such as SVD and SVD++, where the linear interactions among users and items may have a limitation in fully exploiting the additional negative preferences from uninteresting items. To overcome this limitation, we explore CF based on deep learning models which are highly flexible and thus expected to fully enjoy the benefits from uninteresting items. Empirically, our proposed models equipped with Zero-Injection achieve great improvements of recommendation accuracy under various situations such as basic top-N recommendation, long-tail item recommendation, and recommendation to cold-start users.

Dong-Kyu Chae, Jin-Soo Kang, Sang-Wook Kim
DEAMER: A Deep Exposure-Aware Multimodal Content-Based Recommendation System

Modern content-based recommendation systems have greatly benefited from deep neural networks, which can effectively learn feature representations from item descriptions and user profiles. However, the supervision signals to guide the representation learning are generally incomplete (i.e., the majority of ratings are missing) and/or implicit (i.e., only historical interactions showing implicit preferences are available). The learned representations will be biased in this case; and consequently, the recommendations are over-specified. To alleviate this problem, we present a Deep Exposure-Aware Multimodal contEnt-based Recommender (i.e., DEAMER) in this paper. DEAMER can jointly exploit rating and interaction signals via multi-task learning. DEAMER mimics the expose-evaluate process in recommender systems where an item is evaluated only if it is exposed to the user. DEAMER generates the exposure status by matching multi-modal user and item content features. Then the rating value is predicted based on the exposure status. To verify the effectiveness of DEAMER, we conduct comprehensive experiments on a variety of e-commerce data sets. We show that DEAMER outperforms state-of-the-art shallow and deep recommendation models on recommendation tasks such as rating prediction and top-k recommendation. Furthermore, DEAMER can be adapted to extract insightful patterns of both users and items.

Yunsen Hong, Hui Li, Xiaoli Wang, Chen Lin
Recurrent Convolution Basket Map for Diversity Next-Basket Recommendation

Next-basket recommendation plays an important role in both online and offline market. Existing methods often suffer from three challenges: information loss in basket encoding, sequential pattern mining of the shopping history, and the diversity of recommendations. In this paper, we contribute a novel solution called Rec-BMap (“Recurrent Convolution Basket Map”), to address those three challenges. Specifically, we first propose basket map, which encodes not only the items in a basket without losing information, but also static and dynamic properties of the items in the basket. A convolutional neural network followed by the basket map is used to generate basket embedding. Then, a Time-LSTM with time-gate is proposed to learn the sequence pattern from consumer’s historical transactions with different time intervals. Finally, a deconvolutional neural network is employed to generate diverse next-basket recommendation. Experiments on two real-world datasets demonstrate that the proposed model outperforms existing baselines.

Youfang Leng, Li Yu, Jie Xiong, Guanyu Xu
Modeling Long-Term and Short-Term Interests with Parallel Attentions for Session-Based Recommendation

The aim of session-based recommendation is to predict the users’ next clicked item, which is a challenging task due to the inherent uncertainty in user behaviors and anonymous implicit feedback information. A powerful session-based recommender can typically explore the users’ evolving interests (i.e., a combination of his/her long-term and short-term interests). Recent advances in attention mechanisms have led to state-of-the-art methods for solving this task. However, there are two main drawbacks. First, most of the attention-based methods only simply utilize the last clicked item to represent the user’s short-term interest ignoring the temporal information and behavior context, which may fail to capture the recent preference of users comprehensively. Second, current studies typically think long-term and short-term interests as equally important, but the importance of them should be user-specific. Therefore, we propose a novel Parallel Attention Network model (PAN) for Session-based Recommendation. Specifically, we propose a novel time-aware attention mechanism to learn user’s short-term interest by taking into account the contextual information and temporal signals simultaneously. Besides,we introduce a gated fusion method that adaptively integrates the user’s long-term and short-term preferences to generate the hybrid interest representation. Experiments on the three real-world datasets show that PAN achieves obvious improvements than the state-of-the-art methods.

Jing Zhu, Yanan Xu, Yanmin Zhu

Industrial Papers

Frontmatter
Recommendation on Heterogeneous Information Network with Type-Sensitive Sampling

Most entities and relations for recommendation tasks in the real world are of multiple types, large-scale, and power-law. The heterogeneous information network (HIN) based approaches are widely used in recommendations to model the heterogeneous data. However, most HIN based approaches learn the latent representation of entities through meta-path, which is predefined by prior knowledge and thus limits the combinatorial generalization of HIN. Graph neural networks (GNNs) collect and generalize the information of nodes on the receptive field, but most works focus on homogeneous graphs and fail to scale up with regard to power-law graphs. In this paper, we propose a HIN based framework for recommendation tasks, where we utilize GNNs with a type-sensitive sampling to handle the heterogeneous and power-law graphs. For each layer, we adopt schema-based attention to output the distribution of sampling over types, and then we use the importance sampling inside each type to output the sampled neighbors. We conduct extensive experiments on four public datasets and one private dataset, and all datasets are selected carefully for covering the different scales of the graph. In particular, on the largest heterogeneous graph with 0.4 billion edges, we improve the square error by 2.5% while yielding a 26% improvement of convergence time during training, which verifies the effectiveness and scalability of our method regarding the industrial recommendation tasks .

Jinze Bai, Jialin Wang, Zhao Li, Donghui Ding, Jiaming Huang, Pengrui Hui, Jun Gao, Ji Zhang, Zujie Ren
Adaptive Loading Plan Decision Based upon Limited Transport Capacity

Cargo distribution is one of most critical issues for steel logistics industry, whose core task is to determine cargo loading plan for each truck. Due to cargos far outnumber available transport capacity in steel logistics industry, traditional policies treat all cargos equally and distribute them to each arrived trucks with the aim of maximizing the load for each truck. However, they ignore timely delivering high-priority cargos, which causes a great loss to the profit of the steel enterprise. In this paper, we first bring forward a data-driven cargo loading plan decision framework based on the target of high-priority cargo delivery maximization, called as ALPD. To be specific, through analyzing historical steel logistics data, some significant limiting rules related to loading plan decision process are extracted. Then a two-step online decision mechanism is designed to achieve optimal cargo loading plan decision in each time period. It consists of genetic algorithm-based loading plan generation and breadth first traversal-based loading plan path searching. Furthermore, adaptive time window based solution is introduced to address the issue of low decision efficiency brought by uneven distribution of number of arrived trucks within different time periods. Extensive experimental results on real steel logistics data generated from Rizhao Steel’s logistics platform validate the effectiveness and practicality of our proposal.

Jiaye Liu, Jiali Mao, Jiajun Liao, Yuanhang Ma, Ye Guo, Huiqi Hu, Aoying Zhou, Cheqing Jin
Intention-Based Destination Recommendation in Navigation Systems

Taking natural languages as inputs has been widely used in applications. Since the navigation application, which one of the fundamental and most used applications, is usually used during driving, thus the navigation application to take natural language as inputs can reduce the risk of driving. In reality, people use different ways to express their moving intentions. For example, ‘I want to go to the bank’ and ‘I want to cash check’ both reveal that the user wants to go to the bank. So we propose a new navigation system to take natural languages as inputs and recommend destinations to users by detecting users’ moving intentions. The navigation system firstly utilizes a wealth of check-in data to extract corresponding words, including stuff and actions, for different types of locations. Then the extracted information is used to detect users’ moving intentions and recommend suitable destinations. We formalize this task as a problem of constructing a model to classify users’ input sentences into location types, and finding proper destinations to recommend. For empirical study, we conduct extensive experiments based on real datasets to evaluate the performance and effectiveness of our navigation system.

Shuncheng Liu, Guanglin Cong, Bolong Zheng, Yan Zhao, Kai Zheng, Han Su
Towards Accurate Retail Demand Forecasting Using Deep Neural Networks

Accurate product sales forecasting, or known as demand forecasting, is important for retails to avoid either insufficient or excess inventory in product warehouse. Traditional works adopt either univariate time series models or multivariate time series models. Unfortunately, previous prediction methods frequently ignore the inherent structural information of product items such as the relations between product items and brands and the relations among various product items, and cannot perform accurate forecast. To this end, in this paper, we propose a deep learning-based prediction model, namely Structural Temporal Attention network (STANet), to adaptively capture the inherent inter-dependencies and temporal characteristics among product items. STANet uses the graph attention network and a variable-wise temporal attention to extract inter-dependencies among product items and to discover dynamic temporal characteristics, respectively. Evaluation on two real-world datasets validates that our model can achieve better results when compared with state-of-the-art methods.

Shanhe Liao, Jiaming Yin, Weixiong Rao

Demo Papers

Frontmatter
AuthQX: Enabling Authenticated Query over Blockchain via Intel SGX

With the popularization of blockchain technology in traditional industries, though the desire for supporting various authenticated queries becomes more urgent, current blockchain platforms cannot offer sufficient means of achieving authenticated query for light clients, because Authenticated Data Structure (ADS) suffers from performance issues and state-of-the-art Trust Execution Environment (TEE) cannot deal with large-scale applications conveniently due to limited secure memory. In this study, we present a new query authentication scheme, named AuthQX, leveraging the commonly available trusted environment of Intel SGX. AuthQX organizes data hierarchically in trusted SGX enclave and untrusted memory to implement authenticated query cheaply.

Shuaifeng Pang, Qifeng Shao, Zhao Zhang, Cheqing Jin
SuperQuery: Single Query Access Technique for Heterogeneous DBMS

With the increasing interest in machine learning, data management has been receiving significant attention. We propose SuperQuery, a big data virtualization technique that abstracts the physical elements of heterogeneous DBMS in various systems and integrates data into a single access channel. SuperQuery can integrate and collect data regardless of the location, shape, and structure of the data and enable rapid data analysis.

Philip Wootaek Shin, Kyujong Han, Gibeom Kil
MDSE: Searching Multi-source Heterogeneous Material Data via Semantic Information Extraction

In this paper, we demonstrate MDSE, which provides effective information extraction and searching for multi-source heterogeneous materials data that are collected as XML documents. The major features of MDSE are: (1) We propose a transfer-learning-based approach to extract material information from non-textual material data, including images, videos, etc. (2) We present a heterogeneous-graph-based method to extract the semantic relationships among material data. (3) We build a search engine with both Google-like and tabular searching UIs to provide functional searching on integrated material data. After a brief introduction to the architecture and key technologies of MDSE, we present a case study to demonstrate the working process and the effectiveness of MDSE.

Jialing Liang, Peiquan Jin, Lin Mu, Xin Hong, Linli Qi, Shouhong Wan
BigARM: A Big-Data-Driven Airport Resource Management Engine and Application Tools

Resource management becomes a critical issue in airport operation since passenger throughput grows rapidly but the fixed resources such as baggage carousels hardly increase. We propose a Big-data-driven Airport Resource Management (BigARM) engine and develop a suite of application tools for efficient resource utilization and achieving customer service excellence. Specifically, we apply BigARM to manage baggage carousels, which balances the overload carousels and reduces the planning and rescheduling workload for operators. With big data analytic techniques, BigARM accurately predicts the flight arrival time with features extracted from cross-domain data. Together with a multi-variable reinforcement learning allocation algorithm, BigARM makes intelligent allocation decisions for achieving baggage load balance. We demonstrate BigARM in generating full-day initial allocation plans and recommendations for the dynamic allocation adjustments and verify its effectiveness.

Ka Ho Wong, Jiannong Cao, Yu Yang, Wengen Li, Jia Wang, Zhongyu Yao, Suyan Xu, Esther Ahn Chian Ku, Chun On Wong, David Leung
S2AP: Sequential Senti-Weibo Analysis Platform

Microblogging sentiment analysis aims at exploring people’s opinion on social networks such as Twitter and Weibo. Existing work mainly focus on the English corpus based on Distant Supervision, which ignores the noise data in corpus and internationalization. The field of Weibo sentiment analysis lacks a large-scale and complete corpus for application and evaluation. In this work, we formulate the problem of corpus construction into an Information Retrieval problem and construct a Weibo sentiment analysis corpus called Senti-weibo. We also release a weibo pre-processing toolkit in order to unify the pre-processing rules of Weibo text. Eventually, we apply these works to implement a real-time Weibo sentiment analysis platform: S2AP, which serves to analyze and track the sequential sentiment of Weibo topics.

Shuo Wan, Bohan Li, Anman Zhang, Wenhuan Wang, Donghai Guan
An Efficient Secondary Index for Spatial Data Based on LevelDB

Spatial data has the characteristics of spatial location, unstructured, spatial relationships, massive data. However, the general commercial database itself is difficult to meet the requirements, it’s non-trivial to add spatial expansion because spatial data in KVS has brought new challenges. First, the Key-Value database itself does not have a way to query key from its value. Second, we need to ensure both data consistency and timeliness of spatial data. To this end, we propose a secondary index based on LevelDB and R-tree, it supports two-dimensional data indexing and K-Nearest Neighbor algorithm querying. Further, we have optimized the query of a large amount of spatial data caused by the movement of objects. Finally, we conduct extensive experiments on real-world datasets which show our hierarchical index has small index and excellent query performance.

Rui Xu, Zihao Liu, Huiqi Hu, Weining Qian, Aoying Zhou
A Trustworthy Evaluation System Based on Blockchain

With the development of the Internet, online shopping, online movies, online music and other online services are greatly convenient for people’s life. However, due to the widespread false information in these websites, the interests of users and privacy are threatened. To address this problem, we designed a trustworthy evaluation system based on blockchain named TESB2. TESB2 uses blockchain technology to effectively protect the data and privacy of users. It effectively guides the evaluation behavior of users and ensures the fairness and credibility of evaluation information by combining the evaluation behavior of users with reputation rewards and punishments. And we innovatively propose a malicious user detection method by combining blockchain technology and machine learning algorithm. Finally, we showed TESB2 through the movie scenario, and its performance was satisfactory.

Haokai Ji, Chundong Wang, Xu Jiao, Xiuliang Mo, Wenjun Yang
An Interactive System for Knowledge Graph Search

Recent years, knowledge graphs (KG) have experienced rapid growth since they contain enormous volume of facts about the real world, and become the source of various knowledge. It is hence highly desirable that the query-processing engine of a KG is capable of processing queries presented in natural language directly, though these natural language queries bring various ambiguities. In this paper, we present $$\mathsf {KGBot}$$ , an interactive system for searching information from knowledge graphs with natural language. $$\mathsf {KGBot}$$ has the following characteristics: it (1) understands queries issued with natural languages; (2) resolve query ambiguity via human-computer interaction; and (3) provides a graphical interface to interact with users.

Sinha Baivab, Xin Wang, Wei Jiang, Ju Ma, Huayi Zhan, Xueyan Zhong
STRATEGY: A Flexible Job-Shop Scheduling System for Large-Scale Complex Products

Production scheduling plays an important role in manufacturing. With the rapid growth in product quantity and diversity, scheduling manually becomes increasingly difficult and inefficient. This attracts many researchers to develop systems and algorithms for automatic scheduling. However, existing solutions focus on standard flexible job-shop scheduling problem (FJSP) which requires the operations of each job to be totally ordered, while in reality they are usually partially sequential, resulting in a more general and complicated problem. To tackle this problem, we develop STRATEGY, a light-weight scheduling system with strong generality. In this paper, we describe the main features and key techniques of our system, and present the scenarios to be demonstrated.

Zhiyu Liang, Hongzhi Wang, Jijia Yang
Federated Acoustic Model Optimization for Automatic Speech Recognition

Traditional Automatic Speech Recognition (ASR) systems are usually trained with speech records centralized on the ASR vendor’s machines. However, with data regulations such as General Data Protection Regulation (GDPR) coming into force, sensitive data such as speech records are not allowed to be utilized in such a centralized approach anymore. In this demonstration, we propose and show the method of federated acoustic model optimization in order to solve this problem. This demonstration does not only vividly show the underlying working mechanisms of the proposed method but also provides an interface for the user to customize its hyperparameters. With this demonstration, the audience can experience the effect of federated learning in an interactive fashion and we wish this demonstration would inspire more research on GDPR-compliant ASR technologies.

Conghui Tan, Di Jiang, Huaxiao Mo, Jinhua Peng, Yongxin Tong, Weiwei Zhao, Chaotao Chen, Rongzhong Lian, Yuanfeng Song, Qian Xu
EvsJSON: An Efficient Validator for Split JSON Documents

JSON is one of the most popular formats for publishing and exchanging data. In real application scenarios, due to the limitation in field length of data before storing in the database, a JSON document may be split into multiple documents if it is too long. In such case, the validation of the integrity and accuracy of documents is needed. However, this cannot be solved by existing methods. In this paper, we proposed a novel method to validate JSON documents characterized by being able to deal with split documents. Experiments demonstrated that the proposed method is efficient in validating large-scale JSON documents and performed better than the methods compared.

Bangjun He, Jie Zuo, Qiaoyan Feng, Guicai Xie, Ruiqi Qin, Zihao Chen, Lei Duan
GMDA: An Automatic Data Analysis System for Industrial Production

Data-driven method has shown many advantages over experience- and mechanism-based approaches in optimizing production. In this paper, we propose an AI-driven automatic data analysis system. The system is developed for small and medium-sized industrial enterprises who are lack of expertise on data analysis. To achieve this goal, we design a structural and understandable task description language for problem modeling, propose an supervised learning method for algorithm selecting and implement a random search algorithm for hyper-parameter optimization, which makes our system highly-automated and generic. We choose R language as the algorithm engine due to its powerful analysis performance. The system reliability is ensured by an interactive analysis mechanism. Examples show how our system can apply to representative analysis tasks in manufactory.

Zhiyu Liang, Hongzhi Wang, Hao Zhang, Hengyu Guo
An Intelligent Online Judge System for Programming Training

Online judge (OJ) systems are becoming increasingly popular in various applications such as programming training, competitive programming contests and even employee recruitment, mainly due to their ability of automatic evaluation of code submissions. In higher education, OJ systems have been extensively used in programming courses because the automatic evaluation feature can drastically reduce the grading workload of instructors and teaching assistants and thereby makes the class size scalable. However, in our teaching we feel that existing OJ systems should improve their ability on giving feedback to students and teachers, especially on code errors and knowledge states. The lack of such automatic feedback increases teachers’ involvement and thus prevents college programming training from being more scalable. To tackle this challenge, we leverage historical student data obtained from our OJ system and implement two automated functions, namely, code error prediction and student knowledge tracing, using machine learning models. We demonstrate how students and teachers may benefit from the adoption of these two functions during programming training.

Yu Dong, Jingyang Hou, Xuesong Lu
WTPST: Waiting Time Prediction for Steel Logistical Queuing Trucks

In the absence of reasonable queuing rules for trucks transporting steel raw materials, the trucks have to wait in long queues inside and outside the steel mill. It necessitates effective waiting time prediction method to help the managers to make better queuing rules and enhance the drivers’ satisfaction. However, due to the particularity of steel logistic industry, few researches have conducted to tackle this issue. In transforming process of steel logistical informationization, huge amount of data has been generated in steel logistics platform, which offers us an opportunity to address this issue. This paper presents a waiting time prediction framework, called WTPST. Through analyzing the data from multiple sources including the in-plant and off-plant queuing information, in-plant trucks’ unloading logs and cargo discharging operation capability data, some meaningful features related to the queuing waiting time are extracted. Based upon extracted features, a Game-based modeling mechanism is designed to proliferate predicting precision. We demonstrate that WTPST is capable of predicting the waiting time for each queuing truck, which enhances the efficiency of unloading in steel logistics. In addition, the comparison experimental results proves the prediction accuracy of WTPST outperforms the baseline approaches.

Wei Zhao, Jiali Mao, Shengcheng Cai, Peng Cai, Dai Sun, Cheqing Jin, Ye Guo
A System for Risk Assessment of Privacy Disclosure

The wide spread and sharing of considerable information promotes the development of many industries such as the health care and so on. However, data owners should pay attention to the problem of privacy preservation during the sharing of data. A risk assessment system is presented in this paper. The system can assess the risk of privacy disclosure due to the sharing of data, and help data owners to evaluate whether it is safe or not to share the data.

Zhihui Wang, Siqin Li, Xuchen Zhou, Yu Wang, Wenbiao Xing, Yun Zhu, Zijing Tan, Wei Wang
Backmatter
Title
Database Systems for Advanced Applications
Editors
Yunmook Nah
Prof. Bin Cui
Sang-Won Lee
Prof. Jeffrey Xu Yu
Prof. Yang-Sae Moon
Steven Euijong Whang
Copyright Year
2020
Electronic ISBN
978-3-030-59419-0
Print ISBN
978-3-030-59418-3
DOI
https://doi.org/10.1007/978-3-030-59419-0

Accessibility information for this book is coming soon. We're working to make it available as quickly as possible. Thank you for your patience.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG