nach oben

2019 | Buch

Kapitel lesen Erstes Kapitel lesen

Database Systems for Advanced Applications

24th International Conference, DASFAA 2019, Chiang Mai, Thailand, April 22–25, 2019, Proceedings, Part I

herausgegeben von: Prof. Guoliang Li, Jun Yang, Prof. Dr. Joao Gama, Dr. Juggapong Natwichai, Yongxin Tong

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This two-volume set LNCS 11446 and LNCS 11447 constitutes the refereed proceedings of the 24th International Conference on Database Systems for Advanced Applications, DASFAA 2019, held in Chiang Mai, Thailand, in April 2019.

The 92 full papers and 64 short papers were carefully selected from a total of 501 submissions. In addition, 13 demo papers and 6 tutorial papers are included. The full papers are organized in the following topics: big data; clustering and classification; crowdsourcing; data integration; embedding; graphs; knowledge graph; machine learning; privacy and graph; recommendation; social network; spatial; and spatio-temporal. The short papers, demo papers, and tutorial papers can be found in the volume LNCS 11448, which also includes the workshops of DASFAA 2019.

Inhaltsverzeichnis

Frontmatter

Big Data

Frontmatter

Accelerating Real-Time Tracking Applications over Big Data Stream with Constrained Space

Existing approaches are insufficient to provide real-time results for tracking applications against big and fast data streams. In this paper, we leverage freshness sensitive properties of tracking applications and propose an approximate query answering approach, called FS-Sketch, to accelerating real-time temporal queries over big data streams. FS-Sketch constructs its sketch over high-speed data streams via composed online sampling strategies, including sliding-window sampling and space-constrained sampling. Furthermore, FS-Sketch can compress its sketch into constrained space dynamically via utilizing time-decayed mechanism. We evaluate performance of FS-Sketch using real-world and synthetic datasets. FS-Sketch can respond temporal queries within 2 ms from 1.4 billion records with accurate estimates. Meanwhile, FS-Sketch can also outperform the state-of-the-art big data analytical system (Spark) by 5 orders of magnitude on response time when we query over TB-scale real-world datasets.

Guangjun Wu, Xiaochun Yun, Shupeng Wang, Ge Fu, Chao Li, Yong Liu, Binbin Li, Yong Wang, Zhihui Zhao

A Frequency Scaling Based Performance Indicator Framework for Big Data Systems

It is important for big data systems to identify their performance bottleneck. However, the popular indicators such as resource utilizations, are often misleading and incomparable with each other. In this paper, a novel indicator framework which can directly compare the impact of different indicators with each other is proposed to identify and analyze the performance bottleneck efficiently. A methodology which can construct the indicator from the performance change with the CPU frequency scaling is described. Spark is used as an example of a big data system and two typical SQL benchmarks are used as the workloads to evaluate the proposed method. Experimental results show that the proposed method is accurate compared with the resource utilization method and easy to implement compared with the white-box method. Meanwhile, the analysis with our indicators leads to some interesting findings and valuable performance optimization suggestions for big data systems.

Chen Yang, Zhihui Du, Xiaofeng Meng, Yongjie Du, Zhiqiang Duan

A Time-Series Sockpuppet Detection Method for Dynamic Social Relationships

Multiple identity deception, called as sockpuppet, has been commonly used in online social media to spread rumours, publish hate speeches or evade censors. Current works are continually making efforts to detect sockpuppets based on verbal, non-verbal or network-structure features. Network structure has attracted much attention, while the time series dynamic characteristic of sockpuppet network has not been considered. With our observation, after being blocked, a puppetmaster tends to recover previous social relationships as soon as possible to maintain the propagation influence. The earlier the relationship is recovered, the more important it is. To take advantage of this dynamic nature, a time-series sockpuppet detection method is proposed. We first design a weight representation method to record the dynamic growth of sockpuppet’s social relationships and then transfer sockpuppets detection to a similarity time-series analysis problem. The experiments on two real-world datasets of Sina Weibo demonstrate that our method obtains excellent detection performance, significantly outperforming previous methods.

Wei Zhou, Jingli Wang, Junyu Lin, Jiacheng Li, Jizhong Han, Songlin Hu

Accelerating Hybrid Transactional/Analytical Processing Using Consistent Dual-Snapshot

To efficiently deal with OLTP and OLAP workload simultaneously, the conventional method is to deploy two separate processing systems, i.e., the transaction-friendly designed OLTP system and the analytic-friendly optimized OLAP system. To maintain the freshness of the data, extra ETL tools are required to propagate data from the OLTP to the OLAP system. However, low-speed ETL processing is the bottleneck in those business decision support systems. As a result, there has been tremendous interest in developing hybrid transactional/analytical processing (HTAP) systems. This paper proposes a wait-free HTAP (WHTAP) architecture, that can perform both OLTP and OLAP requests efficiently in a wait-free form. We develop and evaluate a prototype WHTAP system. Our experiments show that the system can obtain a similar OLTP performance as the TicToc system and a four to six times acceleration in analytical processing at the same time.

Liang Li, Gang Wu, Guoren Wang, Ye Yuan

HSDS: An Abstractive Model for Automatic Survey Generation

Automatic survey generation for a specific research area can quickly give researchers an overview, and help them recognize the technical developing trend of the specific area. As far as we know, the most relevant study with automatic survey generation is the task of automatic related work generation. Almost all existing methods of automatic related work generation extract the important sentences from multiple relevant papers to assemble a related work. However, the extractive methods are far from satisfactory because of poor coherence and readability. In this paper, we propose a novel abstractive method named Hierarchical Seq2seq model based on Dual Supervision (HSDS) to solve problems above. Given multiple scientific papers in the same research area as input, the model aims to generate a corresponding survey. Furthermore, we build a large dataset to train and evaluate the HSDS model. Extensive experiments demonstrate that our proposed model performs better than the state-of-the-art baselines.

Xiao-Jian Jiang, Xian-Ling Mao, Bo-Si Feng, Xiaochi Wei, Bin-Bin Bian, Heyan Huang

PU-Shapelets: Towards Pattern-Based Positive Unlabeled Classification of Time Series

Real-world time series classification applications often involve positive unlabeled (PU) training data, where there are only a small set PL of positive labeled examples and a large set U of unlabeled ones. Most existing time series PU classification methods utilize all readings in the time series, making them sensitive to non-characteristic readings. Characteristic patterns named shapelets present a promising solution to this problem, yet discovering shapelets under PU settings is not easy. In this paper, we take on the challenging task of shapelet discovery with PU data. We propose a novel pattern ensemble technique utilizing both characteristic and non-characteristic patterns to rank U examples by their possibilities of being positive. We also present a novel stopping criterion to estimate the number of positive examples in U. These enable us to effectively label all U training examples and conduct supervised shapelet discovery. The shapelets are then used to build a one-nearest-neighbor classifier for online classification. Extensive experiments demonstrate the effectiveness of our method.

Shen Liang, Yanchun Zhang, Jiangang Ma

Clustering and Classification

Frontmatter

Discovering Relationship Patterns Among Associated Temporal Event Sequences

Sequential data mining is prevalent in many real world applications, such as gene sequence analysis, consumer shopping log analysis, social networking analysis, and banking transaction analysis. Contrast sequence data mining is useful in describing the differences between two sets (classes) of sequences. However, in prior studies, little work has been done in how to mine the patterns from sequences formed by associated temporal events, where there exist relationships in chronological order between any two events in a sequence. To fill this gap, we consider the problem of mining associated temporal relationship pattern (ATRP) and propose a method, called ATTEND (AssociaTed Temporal rElationship patterN Discovery), to discover ATRPs with top contrast measure from two sets of associative temporal event sequences. Moreover, we design several heuristic strategies to improve the efficiency of ATTEND. Experiments on both real and synthetic data demonstrate that ATTEND is effective and efficient.

Chao Han, Lei Duan, Zhangxi Lin, Ruiqi Qin, Peng Zhang, Jyrki Nummenmaa

Efficient Mining of Event Periodicity in Data Series

This paper investigates the problem of efficiently discovering periodicity of a certain event in data series. To that end, the current work argues firstly that the periodicity of an event in data series may be formalized as the distribution period, the structure period, or the both. Along this line, a partition method, $$\pi (n)$$ , is proposed to divide the data series into length-equal and position-continuous segments. Based on the results of implementing $$\pi (n)$$ on a data series, we propose two new concepts of distribution periodicity and structure periodicity. Then, a cross-entropy-based method, namely CEPD, is proposed to mine the periodicity of data series. The experimental results show that CEPD can be used to mine feasible event periodicity in data series, especially, with very low level of time consumption and high capability of noise resilience.

Hua Yuan, Yu Qian, Mengna Bai

EPPADS: An Enhanced Phase-Based Performance-Aware Dynamic Scheduler for High Job Execution Performance in Large Scale Clusters

The way in which jobs are scheduled is critical to achieve high job processing performance in large scale data clusters. Most existing scheduling mechanism employs a First-In First-Out, serialized approach encompassed with task straggler hunting techniques which launches speculative tasks after detecting slow tasks. This is often achieved through the instrumentation of processing nodes. Such node instrumentation incurs frequent communication overheads as the number of processing nodes increase. Moreover the sequential scheduling of job tasks and the straggler hunting approach fails to meet optimal performance as they increase job waiting time in queue and incurs delayed speculative execution of straggling tasks respectively. In this paper we propose an Enhanced Phase based Performance Aware Dynamic Scheduler (EPPADS), which schedules job tasks without additional instrumentation modules. EPPADS uses a two staged scheduling approach, that is, the slow start phase (SSP) and accelerate phase (AccP). The SSP schedules the initial task in the queue in the normal FIFO way and records the initial execution times of the processing nodes. The AccP uses the initial execution times to compute the processing nodes task distribution ratio of the remaining tasks and schedules them using a single scheduling I/O. We implement EPPADS scheduler in Hadoop’s MapReduce framework. Our evaluation shows that EPPADS can achieve a performance improvement on FIFO scheduler of 30%. Compared with existing Dynamic scheduling approach which uses node instrumentation, EPPADS achieves a better performance of 22%.

Prince Hamandawana, Ronnie Mativenga, Se Jin Kwon, Tae-Sun Chung

Incremental Discovery of Order Dependencies on Tuple Insertions

Order dependencies (ODs) are recently proposed to describe a relationship of ordering between lists of attributes. It is typically too costly to design ODs manually, since the number of possible ODs is of a factorial complexity in the number of attributes. To this end, automatic discovery techniques for ODs are developed. In practice, data is frequently updated, especially with tuple insertions. Existing techniques do not lend themselves well to these situations, since it is prohibitively expensive to recompute all ODs from scratch after every update. In this paper, we make a first effort to investigate incremental OD discovery techniques in response to tuple insertions. Given a relation D, a set $$\varSigma $$ of valid and minimal ODs on D, and a set $$\triangle D$$ of tuple insertions to D, it is to find, changes $$\triangle \varSigma $$ to $$\varSigma $$ that makes $$\varSigma \oplus \triangle \varSigma $$ a set of valid and minimal ODs on $$D + \triangle D$$ . Note that $$\triangle \varSigma $$ contains both new ODs to be added to $$\varSigma $$ and outdated ODs to be removed from $$\varSigma $$ . Specifically, (1) We formalize the incremental OD discovery problem. Although the incremental discovery problem has a same complexity as its batch (non-incremental) counterpart in terms of traditional complexity, we show that it has good data locality. It is linear in the size of $$\triangle D$$ to validate on $$D+\triangle D$$ any OD $$\varphi $$ that is valid on D. (2) We present effective incremental OD discovery techniques, leveraging an intelligent traversal strategy for finding $$\triangle \varSigma $$ and chosen indexes to minimize access to D. Our approach computes $$\triangle \varSigma $$ based on ODs in $$\varSigma $$ , and is independent of the size of D. (3) Using real-life data, we experimentally verify that our approach substantially outperforms its batch counterpart by orders of magnitude.

Lin Zhu, Xu Sun, Zijing Tan, Kejia Yang, Weidong Yang, Xiangdong Zhou, Yingjie Tian

Multi-view Spectral Clustering via Multi-view Weighted Consensus and Matrix-Decomposition Based Discretization

In recent years, multi-view clustering has been widely used in many areas. As an important category of multi-view clustering, multi-view spectral clustering has recently shown promising advantages in partitioning clusters of arbitrary shapes. Despite significant success, there are still two challenging issues in multi-view spectral clustering, i.e., (i) how to learn a similarity matrix for multiple weighted views and (ii) how to learn a robust discrete clustering result from the (continuous) eigenvector domain. To simultaneously tackle these two issues, this paper proposes a unified spectral clustering approach based on multi-view weighted consensus and matrix-decomposition based discretization. In particular, a multi-view consensus similarity matrix is first learned with the different views weighted w.r.t. their confidence. Then the eigen-decomposition is performed on the similarity matrix and a set of c eigenvectors are obtained. From the eigenvectors, we first learn a continuous cluster label and then discretize it to build the final clustering label, which avoids the potential instability of the conventional k-means discretization. Extensive experiments have been conducted on multiple multi-view datasets to validate the superiority of our proposed approach.

Man-Sheng Chen, Ling Huang, Chang-Dong Wang, Dong Huang

SIRCS: Slope-intercept-residual Compression by Correlation Sequencing for Multi-stream High Variation Data

Multi-stream data with high variation is ubiquitous in the modern network systems. With the development of telecommunication technologies, robust data compression techniques are urged to be developed. In this paper, we humbly introduce a novel technique specifically for high variation signal data: SIRCS, which applies linear regression model for slope, intercept and residual decomposition of the multi data stream and combines the advanced tree mapping techniques. SIRCS inherits the advantages from the existing grouping compression algorithms, like GAMPS. With the newly invented correlation sorting techniques: the correlation tree mapping, SIRCS can practically improve the compression ratio by 13% from the traditional clustering mapping scheme. The application of the linear model decomposition can further facilitate the improvement of the algorithm performance from the state-of-art algorithms, with the RMSE decrease 4% and the compression time dramatically drop compared to the GAMPS. With the wide range of the error tolerance from 1% to 27%, SIRCS performs consistently better than all evaluated state-of-art algorithms regarding compression efficiency and accuracy.

Zixin Ye, Wen Hua, Liwei Wang, Xiaofang Zhou

Crowdsourcing

Frontmatter

Fast Quorum-Based Log Replication and Replay for Fast Databases

The modern In-Memory Database (IMDB) can support highly concurrent OLTP workloads and generate massive transactional logs per second. Quorum based replication protocols such as Paxos or Raft have been widely used in distributed databases. However, it’s non-trivial to replicate IMDB because high transaction rate has brought new challenges. First, the leader node in quorum replication should have adaptivity by considering various transaction arrival rates and the processing capability of follower nodes. Second, followers are required to replay logs to catch up the state of the leader in the highly concurrent setting to reduce visibility gap. To this end, we built QuorumX, an efficient quorum-based replication framework for IMDB under heavy OLTP workloads. QuorumX combines critical path based batching and pipeline batching to provide an adaptive log propagation scheme to obtain a stable and high performance at various settings. Further, we propose a safe and coordination-free log replay scheme to minimize the visibility gap between the leader and follower IMDBs. Our evaluation results with the YCSB and TPC-C benchmarks demonstrate that QuorumX achieves the performance close to asynchronous primary-backup replication without sacrificing the data consistency and availability.

Donghui Wang, Peng Cai, Weining Qian, Aoying Zhou

PDCS: A Privacy-Preserving Distinct Counting Scheme for Mobile Sensing

Mobile sensing mines group information through sensing and aggregating users’ data. Among major mobile sensing applications, the distinct counting problem aiming to find the number of distinct elements in a data stream with repeated elements, is extremely important for avoiding waste of resources. Besides, the privacy protection of users is also a critical issue for aggregation security. However, it is a challenge to meet these two requirements simultaneously since normal privacy-preserving methods would have negative influence on the accuracy and efficiency of distinct counting. In this paper, we propose a Privacy-preserving Distinct Counting Scheme (PDCS) for mobile sensing. By integrating the basic idea of homomorphic encryption into Flajolet-Martin (FM) sketch, PDCS allows an aggregator to conduct distinct counting over large-scale data sets without knowing privacy of users. Moreover, PDCS supports various forms of sensing data, including camera images, location data, etc. PDCS expands each bit of the hashing values of users’ original data, FM sketch is thus enhanced for encryption to protect users’ privacy. We prove the security of PDCS under known-plaintext model. The theoretic and experimental results show that PDCS achieves high counting accuracy and practical efficiency with scalability over large-scale data sets.

Xiaochen Yang, Ming Xu, Shaojing Fu, Yuchuan Luo

Reinforced Reliable Worker Selection for Spatial Crowdsensing Networks

Spatial Crowdsensing Networks limit the sensing tasks in some special places where workers should sense data for them. Due to the lack of a priori information about quality of workers, guaranteeing the quality of the sensing tasks remains a key challenge. In this paper, we model the quality of workers through two factors, namely bias and variance, which describe the continuous value feature of sensing tasks. After calibrating the bias, we should iteratively estimate worker variances more and more accurately. Meanwhile, we should select more reliable workers with low variances to finish sensing tasks. This is a classic exploration and exploitation dilemma. Therefore, to overcome the dilemma, we design a novel Multi-Armed Bandit (MAB) algorithm which is based on Upper Confidence Bounds (UCB) scheme and combined with a weighted data aggregation scheme to calculate a better ground truth of a sensing task. Then, we prove the expected sensing error of sensing tasks can be bounded according to the regret bound of the MAB in our setting. In simulation experiments, we use a real world data set to validate the theoretical results of our algorithm and it outperforms two baselines significantly in different settings.

Yang Wang, Junwei Lu, Jingxiao Chen, Xiaofeng Gao, Guihai Chen

SeqST-ResNet: A Sequential Spatial Temporal ResNet for Task Prediction in Spatial Crowdsourcing

Task appearance prediction has great potential to improve task assignment in spatial crowdsourcing platforms. The main challenge of this prediction problem is to model the spatial dependency among neighboring regions and the temporal dependency at different time scales (e.g., hourly, daily, and weekly). A recent model ST-ResNet predicts traffic flow by capturing the spatial and temporal dependencies in historical data. However, the data fragments are concatenated as one tensor fed to the deep neural networks, rather than learning the temporal dependencies in a sequential manner. We propose a novel deep learning model, called SeqST-ResNet, which well captures the temporal dependencies of historical task appearance in sequences at several time scales. We validate the effectiveness of our model via experiments on a real-world dataset. The experimental results show that our SeqST-ResNet model significantly outperforms ST-ResNet when predicting tasks at hourly intervals and also during weekday and weekends, more importantly, in regions with intensive task requests.

Dongjun Zhai, An Liu, Shicheng Chen, Zhixu Li, Xiangliang Zhang

Towards Robust Arbitrarily Oriented Subspace Clustering

Clustering high-dimensional data is challenging since meaningful clusters usually hide in the arbitrarily oriented subspaces, and classical clustering algorithms like k-means tend to fail in such case. Subspace clustering has thus attracted growing attention in the last decade and many algorithms have been proposed such as ORCLUS and 4C. However, existing approaches are usually sensitive to global and/or local noisy points, and the overlapping subspace clusters are little explored. Beyond, these approaches usually involve the exhaustive local search for correlated points or subspaces, which is infeasible in some cases. To deal with these problems, in this paper, we introduce a new subspace clustering algorithm called RAOSC, which formulates the Robust Arbitrarily Oriented Subspace Clustering as a group structure low-rank optimization problem. RAOSC is able to recover subspace clusters from a sea of noise while noise and overlapping points can be naturally identified during the optimization process. Unlike existing low-rank based subspace clustering methods, RAOSC can explicitly produce the subspaces of clusters without any prior knowledge of subspace dimensionality. Furthermore, RAOSC does not need a post-processing procedure to obtain the clustering result. Extensive experiments on both synthetic and real-world data sets have demonstrated that RAOSC allows yielding high-quality clusterings and outperforms many state-of-the-art algorithms.

Zhong Zhang, Chongming Gao, Chongzhi Liu, Qinli Yang, Junming Shao

Truthful Crowdsensed Data Trading Based on Reverse Auction and Blockchain

Crowdsensed Data Trading (CDT) is a novel data trading paradigm, in which each data consumer can publicize its demand as some crowdsensing tasks, and some mobile users can compete for these tasks, collect the corresponding data, and sell the results to the consumers. Existing CDT systems either depend on a trusted data trading broker or cannot ensure sellers to report costs honestly. To address this problem, we propose a Reverse-Auction-and-blockchain-based crowdsensed Data Trading (RADT) system, mainly containing a smart contract, called RADToken. We adopt a greedy strategy to determine winners, and prove the truthfulness and individual rationality of the whole reverse auction process. Moreover, we exploit the smart contract with a series of devises to enforce mutually untrusted parties to participate in the data trading honestly. Additionally, we also deploy RADToken on an Ethereum test network to demonstrate its significant performances. To the best of our knowledge, this is the first CDT work that exploits both auction and blockchain to ensure the truthfulness of the whole data trading process.

Baoyi An, Mingjun Xiao, An Liu, Guoju Gao, Hui Zhao

Data Integration

Frontmatter

Selective Matrix Factorization for Multi-relational Data Fusion

Matrix factorization based data fusion solutions can account for the intrinsic structures of multi-relational data sources, but most solutions equally treat these sources or prefer sparse ones, which may be irrelevant for the target task. In this paper, we introduce a Selective Matrix Factorization based Data Fusion approach (SelMFDF) to collaboratively factorize multiple inter-relational data matrices into low-rank representation matrices of respective object types and optimize the weights of them. To avoid preference to sparse data matrices, it additionally regularizes these low-rank matrices by approximating them to multiple intra-relational data matrices and also optimizes the weights of them. Both weights contribute to automatically integrate relevant data sources. Finally, it reconstructs the target relational data matrix using the optimized low-rank matrices. We applied SelMFDF for predicting inter-relations (lncRNA-miRNA interactions, functional annotations of proteins) and intra-relations (protein-protein interactions). SelMFDF achieves a higher AUROC (area under the receiver operating characteristics curve) by at least 5.88%, and larger AUPRC (area under the precision-recall curve) by at least 18.23% than other related and competitive approaches. The empirical study also confirms that SelMFDF can not only differentially integrate these relational data matrices, but also has no preference toward sparse ones.

Yuehui Wang, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Xiangliang Zhang, Maozu Guo

Selectivity Estimation on Set Containment Search

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset $$\mathcal S$$ , we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over $$\mathcal S$$ . The problem has many important applications in commercial fields and scientific studies.To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques.

Yang Yang, Wenjie Zhang, Ying Zhang, Xuemin Lin, Liping Wang

Typicality-Based Across-Time Mapping of Entity Sets in Document Archives

News archives constitute a rich source of knowledge about the past societies. In order to effectively utilize such large and diverse accounts of the past, novel approaches need to be proposed. One of them is comparison of the past and present entities which can lay grounds for better comprehending the past and the present, as well as can support forecasting techniques. In this paper, we propose a novel research task of automatically generating across-time comparable entity pairs given two sets of entities, as well as we introduce an effective method to solve this task. The proposed model first applies the idea of typicality analysis to measure the representativeness of each entity. Then, it learns an orthogonal transformation between temporally distant entity collections. Finally, it generates a set of typical comparables based on a concise integer linear programming framework. We experimentally demonstrate the effectiveness of our method on the New York Times corpora through both qualitative and quantitative tests.

Yijun Duan, Adam Jatowt, Sourav S. Bhowmick, Masatoshi Yoshikawa

Unsupervised Entity Alignment Using Attribute Triples and Relation Triples

Entity alignment aims to find entities referring to the same real-world object across different knowledge graphs (KGs). Most existing works utilize the relations between entities contained in the relation triples with embedding-based approaches, but require a large number of training data. Some recent attempt works on using types of their attributes in attribute triples for measuring the similarity between entities across KGs. However, due to diverse expressions of attribute names and non-standard attribute values across different KGs, the information contained in attribute triples can not be fully used. To tackle the drawbacks of the existing efforts, we novelly propose an unsupervised entity alignment approach using both attribute triples and relation triples of KGs. Initially, we propose an interactive model to use attribute triples by performing entity alignment and attribute alignment alternately, which will generate a lot of high-quality aligned entity pairs. We then use these aligned entity pairs to train a relation embedding model such that we could use relation triples to further align the remaining entities. Lastly, we utilize a bivariate regression model to learn the respective weights of similarities measuring from the two aspects for a result combination. Our empirical study performed on several real-world datasets shows that our proposed method achieves significant improvements on entity alignment compared with state-of-the-art methods.

Fuzhen He, Zhixu Li, Yang Qiang, An Liu, Guanfeng Liu, Pengpeng Zhao, Lei Zhao, Min Zhang, Zhigang Chen

Combining Meta-Graph and Attention for Recommendation over Heterogenous Information Network

Recently heterogeneous information network (HIN) has gained wide attention in recommender systems due to its flexibility in modeling rich objects and complex relationships. It’s still challenging for HIN based recommenders to capture high-level structure and fuse the mined features of users and items effectively. In this paper, we propose an approach for the recommendation over HIN, called MGAR, which combines Meta-Graph and Attention to address the challenge. Informally speaking, meta-graph is applied to feature extraction, so as to capture more semantic information, while the attention mechanism is used to fuse the features arising from different meta-graphs. MGAR can be divided into two stages. In the first stage, we apply the matrix factorization technique to generate latent factors based on predefined meta-graphs. In the second stage, the embeddings of users and items are fused with the neural attention mechanism. And then the deep neural network is employed to make recommendations by modeling complicated interactions. Experiments over two real datasets indicate MGAR achieves state-of-the-art performance.

Chenfei Zhao, Hengliang Wang, Yuan Li, Kedian Mu

Efficient Search of the Most Cohesive Co-located Community in Attributed Networks

Attributed networks are used to model various networks, such as social networks, knowledge graphs, and protein-protein interactions. Such networks are associated with rich attributes such as spatial locations (e.g., check-ins from social network users and positions of proteins). The community search in attributed networks have been intensively studied recently due to its wide applications in recommendation, marketing, biology, etc. In this paper, we study the problem of searching the most cohesive co-located community ( $$\textsc {MC}^{3}$$ ), which returns communities that satisfy the following two properties: (i) structural cohesiveness: members in the community are connected the most intensively; (ii) spatial co-location: members are close to each other. The problem can be used for social network user behavior analysis, recommendation, disease predication etc. We first propose an index structure called $$\textsc {D}k\textsc {Q-tree}$$ to integrate the spatial information and the local structure information together to accelerate the query processing. Then, based on this index structure we develop two efficient algorithms. The extensive experiments conducted on both real and synthetic datasets demonstrate the efficiency and effectiveness of the proposed methods.

Jiehuan Luo, Xin Cao, Qiang Qu, Yaqiong Liu

Embedding

Frontmatter

A Weighted Word Embedding Model for Text Classification

Neural bag-of-words models (NBOW) have achieved great success in text classification. They compute a sentence or document representation by mathematical operations such as simply adding and averaging over the word embedding of each sequence element. Thus, NBOW models have few parameters and require low computation cost. Intuitively, considering the important degree of each word and the word-order information for text classification are beneficial to obtain informative sentence or document representation. However, NBOW models hardly consider the above two factors when generating a sentence or document representation. Meanwhile, term weighting schemes assigning relatively high weight values to important words have exhibited successful performance in traditional bag-of-words models. However, it is still seldom used in neural models. In addition, n-grams capture word-order information in short context. In this paper, we propose a model called weighted word embedding model (WWEM). It is a variant of NBOW model introducing term weighting schemes and n-grams. Our model generates informative sentence or document representation considering the important degree of words and the word-order information. We compare our proposed model with other popular neural models on five datasets in text classification. The experimental results show that our proposed model exhibits comparable or even superior performance.

Haopeng Ren, ZeQuan Zeng, Yi Cai, Qing Du, Qing Li, Haoran Xie

Bipartite Network Embedding via Effective Integration of Explicit and Implicit Relations

Network representation learning, or network embedding, aims at mapping the nodes of the network to low-dimensional vector space, in which the learned node representations can be used for a variety of tasks, such as node classification, link prediction, and visualization. As a special class of complex networks, the bipartite network is composed of two different types of nodes in which the links only exist among different types of nodes, has important applications in the recommendation system, link prediction, and disease diagnosis. However, most existing methods for network representation learning are aimed at homogeneous networks in general, while the special properties of bipartite networks are not taken into account, such as the implicit relations (i.e., unobserved links) between nodes of the same type. In this paper, we propose a novel deep learning framework for bipartite networks, which integrates the explicit and implicit relations, while preserving the local and global structure, to learn the highly non-linear representations of nodes. Extensive experiments conducted on several real-world datasets, based on the link prediction, recommendation, and visualization, demonstrate the effectiveness of our proposed method compared with state-of-the-art network representation learning based methods.

Yaping Wang, Pengfei Jiao, Wenjun Wang, Chunyu Lu, Hongtao Liu, Bo Wang

Enhancing Network Embedding with Implicit Clustering

Network embedding aims at learning the low dimensional representation of nodes. These representations can be widely used for network mining tasks, such as link prediction, anomaly detection, and classification. Recently, a great deal of meaningful research work has been carried out on this emerging network analysis paradigm. The real-world network contains different size clusters because of the edges with different relationship types. These clusters also reflect some features of nodes, which can contribute to the optimization of the feature representation of nodes. However, existing network embedding methods do not distinguish these relationship types. In this paper, we propose an unsupervised network representation learning model that can encode edge relationship information. Firstly, an objective function is defined, which can learn the edge vectors by implicit clustering. Then, a biased random walk is designed to generate a series of node sequences, which are put into Skip-Gram to learn the low dimensional node representations. Extensive experiments are conducted on several network datasets. Compared with the state-of-art baselines, the proposed method is able to achieve favorable and stable results in multi-label classification and link prediction tasks.

Qi Li, Jiang Zhong, Qing Li, Zehong Cao, Chen Wang

MDAL: Multi-task Dual Attention LSTM Model for Semi-supervised Network Embedding

In recent years, both the academic and commercial communities have paid great attentions on embedding methods to analyze all kinds of network data. Despite of the great successes of DeepWalk and the following neural models, only a few of them have the ability to incorporate contents and labels into low-dimensional representation vectors of nodes. Besides, most network embedding methods only consider universal representations and the optimal representations could not be learned for specific tasks. In this paper, we propose a Multi-task Dual Attention LSTM model (dubbed as MDAL), which can capture structure, content, and label information of network and adjust representation vectors according to the concrete downstream task simultaneously. For the target node, MDAL leverages Tree-LSTM structure to extract structure, text and label information from its neighborhood. With the help of dual attention mechanism, the content related and label related neighbor nodes are emphasized during embedding. MDAL utilizes a multi-task learning framework that considering both network embedding and downstream tasks. The appropriate loss functions are proposed for task adaption and a joint optimization process is conducted for task-specific network embedding. We compare MDAL with the state-of-the-art and strong baselines for node classification, network visualization and link prediction tasks. Experimental results show the effectiveness and superiority of our proposed MDAL model.

Longcan Wu, Daling Wang, Shi Feng, Yifei Zhang, Ge Yu

Net2Text: An Edge Labelling Language Model for Personalized Review Generation

Writing an item review for online shopping or sharing the dining experience of a restaurant has become major Internet activities of young people. This kind of review system could not only help users express and exchange experience but also prompt business to improve service quality. Instead of taking time to type in the review, we would like to make the review process more automated. In this work, we study an edge labelling language model for personalized review generation, e.g., the problem of generating text (e.g., a review) on the edges of the network (e.g., online shopping). It is related to both network structure and rich text semantic information. Previously, link prediction models have been applied to recommender system and event prediction. However, they could not migrate to text generation on the edges of networks since most of them are numerical prediction or tag labelling tasks. To bridge the gap between link prediction and natural language generation, in this paper, we propose a model called Net2Text, which can simultaneously learn the structural information in the network and build a language model over text on the edges. The performance of Net2Text is demonstrated in our experiments, showing that our model performs better than other baselines, and is able to produce reasonable reviews between users and items.

Shaofeng Xu, Yun Xiong, Xiangnan Kong, Yangyong Zhu

Understanding Information Diffusion via Heterogeneous Information Network Embeddings

Predicting information diffusion in social networks has attracted substantial research efforts. For a specific user in a social network, whether to forward a contagion is impacted by complex interactions from both her neighboring users and the recent contagions she has been involved in, which is difficult to be modeled in a unified model. To address this problem, we investigate the contagion adoption behavior under a set of interactions among users and contagions, which are learned as latent representations. Instead of learning each type of representations separately, we try to jointly encode the users and contagions into the same latent space, where their complex interaction relationships can be properly incorporated. To this end, we construct a heterogeneous information network consisting of users and contagions as two types of objects, and propose a novel random walk algorithm by using meta-path-based proximity as a guide to learn the representations of heterogeneous objects. In the end, to predict contagion adoption, we judiciously design an effective neural network model to capture the interactions based on the representations. The evaluation results on a large-scale Sina Weibo dataset demonstrate our proposal can outperform the competing baselines. Moreover, the latent representations are also suitable for multi-class classification of contagions.

Yuan Su, Xi Zhang, Senzhang Wang, Binxing Fang, Tianle Zhang, Philip S. Yu

Graphs

Frontmatter

Distributed Parallel Structural Hole Detection on Big Graphs

Structural holes in social networks are vertices that serve as gateways for information exchange between communities. Although many algorithms have been proposed to detect structural holes, they are not scalable to big graphs. This paper proposes a structural hole detection algorithm ESH based on distributed parallel graph processing frameworks. Instead of using substructures in social networks, the algorithm exploits a factor diffusion process in structural hole detection. The algorithm naturally fits the vertex-centric programming models and can be easily implemented on the graph-parallel processing frameworks. Extensive experiments show that ESH can handle social networks with billions of links and produce structural holes of higher quality than the existing algorithms.

Faming Li, Zhaonian Zou, Jianzhong Li, Yingshu Li, Yubiao Chen

DynGraphGAN: Dynamic Graph Embedding via Generative Adversarial Networks

Graphs have become widely adopted as a means of representing relationships between entities in many applications. These graphs often evolve over time. Learning effective representations preserving graph topology, as well as latent patterns in temporal dynamics, has drawn increasing interests. In this paper, we investigate the problem of dynamic graph embedding that maps a time series of graphs to a low dimensional feature space. However, most existing works in the field of dynamic representation learning either consider temporal evolution of low-order proximity or treat high-order proximity and temporal dynamics separately. It is challenging to learn one single embedding that can preserve the high-order proximity with long-term temporal dependencies. We propose a Generative Adversarial Networks (GAN) based model, named DynGraphGAN, to learn robust feature representations. It consists of a generator and a discriminator trained in an adversarial process. The generator generates connections between nodes that are represented by a series of adjacency matrices. The discriminator integrates a graph convolutional network for high-order proximity and a convolutional neural network for temporal dependency to distinguish real samples from fake samples produced by the generator. With iterative boosting of the performance of the generator and discriminator, node embeddings are learned to present dynamic evolution over time. By jointly considering high-order proximity and temporal evolution, our model can preserve spatial structure with temporal dependency. DynGraphGAN is optimized on subgraphs produced by random walks to capture more complex structural and temporal patterns in the dynamic graphs. We also leverage sparsity and temporal smoothness properties to further improve the model efficiency. Our model demonstrates substantial gains over several baseline models in link prediction and reconstruction tasks on real-world datasets.

Yun Xiong, Yao Zhang, Hanjie Fu, Wei Wang, Yangyong Zhu, Philip S. Yu

Evaluating Mixed Patterns on Large Data Graphs Using Bitmap Views

Developing efficient and scalable techniques for pattern queries over large graphs is crucial for modern applications such as social networks, Web analysis, and bioinformatics. In this paper, we address the problem of efficiently finding the homomorphic matches for tree pattern queries with child and descendant edges (mixed pattern queries) over a large data graph. We propose a novel type of materialized views to accelerate the evaluation. Our materialized views are the sets of occurrence lists of the nodes of the pattern in the data graph. They are stored as compressed bitmaps on the inverted lists of the node labels in the data graph. Reachability information between occurrence list nodes is provided by a node reachability index. This technique not only minimizes the materialization space but also reduces CPU and I/O costs by translating view materialization processing into bitwise operations. We provide conditions for view usability using the concept of pattern node coverage. We design a holistic bottom-up algorithm which efficiently computes pattern query matches in the data graph using bitmap views. An extensive experimental evaluation shows that our method evaluates mixed patterns up to several orders of magnitude faster than existing algorithms.

Xiaoying Wu, Dimitri Theodoratos, Dimitrios Skoutas, Michael Lan

Heterogeneous Information Network Hashing for Fast Nearest Neighbor Search

Heterogeneous information networks (HINs) are widely used to model real-world information systems due to their strong capability of capturing complex and diverse relations between multiple entities in real situations. For most of the analytical tasks in HINs (e.g., link prediction and node recommendation), network embedding techniques are prevalently used to project the nodes into real-valued feature vectors, based on which we can calculate the proximity between node pairs with nearest neighbor search (NNS) algorithms. However, the extensive usage of real-valued vector representation in existing network embedding methods imposes overwhelming computational and storage challenges, especially when the scale of the network is large. To tackle this issue, in this paper, we conduct an initial investigation of learning binary hash codes for nodes in HINs to obtain the remarkable acceleration of the NNS algorithms. Specifically, we propose a novel heterogeneous information network hashing algorithm based on collective matrix factorization. Through fully characterizing various types of relations among nodes and designing a principled optimization procedure, we successfully project the nodes in HIN into a unified Hamming space, with which the computational and storage burden of NNS can be significantly alleviated. The experimental results demonstrate that the proposed algorithm can indeed lead to faster NNS and requires lower memory usage than several state-of-the-art network embedding methods while showing comparable performance in typical learning tasks on HINs, including link prediction and cross-type node similarity search.

Zhen Peng, Minnan Luo, Jundong Li, Chen Chen, Qinghua Zheng

Learning Fine-Grained Patient Similarity with Dynamic Bayesian Network Embedded RNNs

The adoption of Electronic Health Records (EHRs) enables comprehensive analysis for robust clinical decision-making in the rapidly changing environment. Therefore, using historical and similar patient records, we investigate how to utilize EHRs to provide effective and timely treatments and diagnoses for them under the circumstances that our patients are likely to respond to the therapy. In this paper, We propose a novel framework that embeds the Markov decision process into the multivariate time series analysis to research the meaningful distance among patients in Intensive Care Units (ICU). Specifically, we develop a novel deep learning model TDBNN that employs Triplet architecture, Dynamic Bayesian Network (DBN), and Recurrent Neural Network (RNN). Causal correlations among medical events are firstly obtained by the conditional dependencies in DBN, and to transmit this kind of correlations over time as temporal features, conditional dependencies in DBN are used to construct extra connections among RNN units. With specially-designed connections, the RNN is further utilized as fundamental components of the Triplet architecture to study the fine-grained similarities among patients. The proposed method has been applied to a real-world ICU dataset MIMIC-III. The experimental results between our approach and several existing baselines demonstrate that the proposed approach outperforms those methods and provides a promising direction for the research on clinical decision support.

Yanda Wang, Weitong Chen, Bohan Li, Robert Boots

Towards Efficient k-TriPeak Decomposition on Large Graphs

Analyzing the structure of real-world networks has attracted much attention over years and cohesive subgraph models are commonly used to characterize the structure of a network. Recently, a model named k-Peak is proposed to address the issue failing to detect sparser regions if the network contains distinct regions of different densities in the cohesive subgraph models. However, k-Peak only considers the edge connection (i.e., degree) in the network and the loose structure restricts the effectiveness of the k-Peak. On the other hand, triangles are fundamental building blocks of a network and are widely used in the literature. Motivated by this, in this paper, we propose the k-TriPeak model based on the triangles and study the problem of k-TriPeak decomposition that computes the k-TriPeak for all possible k values to understand the structure of a network. Through investigating the drawbacks of the baseline algorithm following the idea of k-Peak decomposition, we devise a new efficient algorithm to perform the k-TriPeak decomposition. Our new algorithm adopts a top-down decomposition paradigm and integrates two novel upper bounds with which large unnecessary computation can be pruned. We conduct extensive experiments on several large real-world datasets and the experimental results demonstrate the efficiency and effectiveness of our proposed algorithm.

Xudong Wu, Long Yuan, Xuemin Lin, Shiyu Yang, Wenjie Zhang

Knowledge Graph

Frontmatter

Evaluating the Choice of Tags in CQA Sites

Tags play a crucial role in CQA sites by facilitating organization, indexing and categorization of the entire post in a few words. The choice of tags determines the audience that is elicited upon to seek a response for any particular post. This could either lead to receiving an accurate response for the question or result in receiving no answers. The choice of tags, thus, directly determines the quality of the post as well as to a large extent the success of the CQA site itself. In this paper, we a present a novel approach to evaluate the choice of tags in any post. We perform tag network analysis to find relationship between tags. We then find the anomalous combination of tags by performing anomaly detection. We demonstrate the robustness of our approach by showing high AUC, in the range of 0.95 to 0.98, on four datasets from Stack Exchange, namely Ask Ubuntu, Server Fault, Super User and Software Engineering.

Rohan Banerjee, Sailaja Rajanala, Manish Singh

Fast Maximal Clique Enumeration for Real-World Graphs

Maximal Clique Enumeration (MCE) is one of the most fundamental problems in graph theory, and it has extensive applications in graph data analysis. The state-of-art approach (called as $$MCE_{degeneracy}$$ in this paper) that solves MCE problem in real-world graphs first computes the degeneracy ordering of the vertices in a given graph, and then for each vertex, conducts the $$BK_{pivot}$$ algorithm in its neighborhood (called as degeneracy neighborhood in this paper). In real-world graphs, the process of degeneracy ordering produces a large number of dense degeneracy neighborhoods. But, the $$BK_{pivot}$$ algorithm, with its down-to-top nature, adds just one vertex into the result set at each level of recursive calls, and cannot efficiently solve the MCE problem in these dense degeneracy neighborhoods.In this paper, we propose a new MCE algorithm, called as $$BK_{rcd}$$ , to improve the efficiency of MCE in a dense degeneracy neighborhood by recursively conducting core decomposition in it. Contrary to $$BK_{pivot}$$ , $$BK_{rcd}$$ is a top-to-down approach, that repeatedly chooses and “removes” the vertex with the smallest degree until a clique is reached. We further integrate $$BK_{rcd}$$ into $$MCE_{degeneracy}$$ to form a hybrid approach named as $$MCE_{degeneracy}^{hybrid}$$ , that chooses $$BK_{rcd}$$ or $$BK_{pivot}$$ adaptively according to the structural properties of the degeneracy neighborhoods. Experimental results conducted in real-world graphs show that $$MCE_{degeneracy}^{hybrid}$$ achieves high overall performance improvements on the graphs. For example, $$MCE_{degeneracy}^{hybrid}$$ achieves 1.34 $$\times $$ to 2.97 $$\times $$ speedups over $$MCE_{degeneracy}$$ in web graphs taken in our experiments.

Yinuo Li, Zhiyuan Shao, Dongxiao Yu, Xiaofei Liao, Hai Jin

Leveraging Knowledge Graph Embeddings for Natural Language Question Answering

A promising pathway for natural language question answering over knowledge graphs (KG-QA) is to translate natural language questions into graph-structured queries. During the translation, a vital process is to map entity/relation phrases of natural language questions to the vertices/edges of underlying knowledge graphs which can be used to construct target graph-structured queries. However, due to linguistic flexibility and ambiguity of natural language, the mapping process is challenging and has been a bottleneck of KG-QA models. In this paper, we propose a novel framework, called KemQA, which stands on recent advances in relation phrase dictionaries and knowledge graph embedding techniques to address the mapping problem and construct graph-structured queries of natural language questions. Extensive experiments were conducted on question answering benchmark datasets. The results demonstrate that our framework outperforms state-of-the-art baseline models in terms of effectiveness and efficiency.

Ruijie Wang, Meng Wang, Jun Liu, Weitong Chen, Michael Cochez, Stefan Decker

Measuring Semantic Relatedness with Knowledge Association Network

Measuring semantic relatedness between two words is a fundamental task for many applications in both databases and natural language processing domains. Conventional methods mainly utilize the latent semantic information hidden in lexical databases (WordNet) or text corpus (Wikipedia). They have made great achievements based on the distance computation in lexical tree or co-occurrence principle in Wikipedia. However these methods suffer from low coverage and low precision because (1) lexical database contains abundant lexical information but lacks semantic information; (2) in Wikipedia, two related words (e.g. synonyms) may not appear in a window size or a sentence, and unrelated ones may be mentioned together by chance. To compute semantic relatedness more accurately, some other approaches have made great efforts based on free association network and achieved a significant improvement on relatedness measurement. Nevertheless, they need complex preprocessing in Wikipedia. Besides, the fixed score functions they adopt cause the lack of flexibility and expressiveness of model. In this paper, we leverage DBPedia and Wikipedia to construct a Knowledge Association Network (KAN) which avoids the information extraction of Wikipedia. We propose a flexible and expressive model to represent entities behind the words, in which attribute and topological structure information of entities are embedded in vector space simultaneously. The experiment results based on standard datasets show the better effectiveness of our model compared to previous models.

Jiapeng Li, Wei Chen, Binbin Gu, Junhua Fang, Zhixu Li, Lei Zhao

SINE: Side Information Network Embedding

Network embedding learns low-dimensional features for nodes in a network, which benefits the downstream tasks like link prediction and node classification. Real-world networks are often accompanied with rich side information, such as attributes and labels, while most of the efforts on network embedding are devoted to preserving the pure network structure. Integrating side information is a challenging task since the effects of different attributes vary with nodes and the unlabeled nodes can be influenced by diverse labels from neighbors, not to mention the heterogeneity and incompleteness. To overcome this issue, we propose Side Information Network Embedding (SINE), a novel and flexible framework using multiple side information to learn a node representation. SINE defines a flexible and semantical neighborhood to model the inscape of each node and designs a random walk scheme to explore this neighborhood. It can incorporate different attributes information with particular emphasis depending on the characteristics of each node. And label information can be both explicitly and potentially integrated into the representation. We evaluate our method and existing state-of-the-art methods on the tasks of multi-class classification. The experimental results on 5 real-world datasets demonstrate that our method outperforms other methods on the networks with side information.

Zitai Chen, Tongzhao Cai, Chuan Chen, Zibin Zheng, Guohui Ling

A Knowledge Graph Enhanced Topic Modeling Approach for Herb Recommendation

Traditional Chinese Medicine (TCM) plays an important role in Chinese society and is an increasingly popular therapy around the world. A data-driven herb recommendation method can help TCM doctors make scientific treatment prescriptions more precisely and intelligently in real clinical practice, which can lead the development of TCM diagnosis and treatment. Previous works only analyzing short-text medical case documents ignore rich information of symptoms and herbs as well as their relations. In this paper, we propose a novel model called Knowledge Graph Embedding Enhanced Topic Model (KGETM) for TCM herb recommendation. The modeling strategy we used takes into consideration not only co-occurrence information in TCM medical cases but also comprehensive semantic relatedness of symptoms and herbs in TCM knowledge graph. The knowledge graph embeddings are obtained by TransE, a popular representation learning method of knowledge graph, on our constructed TCM knowledge graph. Then the embeddings are integrated into the topic model by a mixture of Dirichlet multinomial component and latent vector component. In addition, we further propose HC-KGETM incorporating herb compatibility based on TCM theory to characterize the diagnosis and treatment process better. Experimental results on a TCM benchmark dataset demonstrate that the proposed method outperforms state-of-the-art approaches and the promise of TCM knowledge graph embedding on herb recommendation.

Xinyu Wang, Ying Zhang, Xiaoling Wang, Jin Chen

Knowledge Base Error Detection with Relation Sensitive Embedding

Recently, knowledge bases (KBs) have become more and more essential and helpful data source for various applications and researches. Although modern KBs have included thousands of millions of facts, they still suffer from incompleteness compared with the total amount of facts in real world. Furthermore, a lot of inaccurate and outdated facts may be contained in the KBs. Although there have been many studies dealing with incompleteness of the KBs, very few of works have taken into account detecting the errors in the KBs. Broadly speaking, there are three main challenges in detecting errors in the KBs. (1) Symbolic and logical form of the knowledge representations cannot detect the inconsistencies very well on large scale KBs. (2) It is hard to capture the correlations between relations. (3) There is no golden standard to learn or observe the patterns of inaccurate facts. In this work, we propose a Relation Sensitive Embedding Approach (RSEA) to detect the inconsistencies from KBs. We first design two correlation functions to measure the relatedness between two relations. Then, a dynamic cluster algorithm is presented to aggregate highly correlated relations into the same clusters. Finally, we encode discrete knowledge facts with effects of correlated relations into continuous vector space, which can effectively detect errors in KBs. We perform extensive experiments on two benchmark datasets, and the results show that our approach achieves high performance in detecting incorrect knowledge facts in these KBs.

San Kim, Xiuxing Li, Kaiyu Li, Jianhua Feng, Yan Huang, Songfan Yang

Leon: A Distributed RDF Engine for Multi-query Processing

As similar queries keep springing up in real query logs, few RDF systems address this problem. In this paper, we propose Leon, a distributed RDF system, which can also deal with multi-query problem. First, we apply a characteristic-set-based partitioning scheme. This scheme (i) supports the fully parallel processing of join within characteristic sets; (ii) minimizes data communication by applying direct transmission of intermediate results instead of broadcasting. Then, Leon revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose a heuristic algorithm that partitions the input batch of queries into groups, and discover the common sub-query of multiple SPARQL queries. Our MQO algorithm incorporates with a subtle cost model to generate execution plans.Our experiments with synthetic and real datasets verify that: (i) Leon’s startup overhead is low; (ii) Leon consistently outperforms centralized RDF engines by 1–2 orders of magnitude, and it is competitive with state-of-the-art distributed RDF engines; (iii) Our MQO approach consistently demonstrates 10 $$\times $$ speedup over the baseline method.

Xintong Guo, Hong Gao, Zhaonian Zou

MathGraph: A Knowledge Graph for Automatically Solving Mathematical Exercises

Knowledge graphs are widely applied in many applications. Automatically solving mathematical exercises is also an interesting task which can be enhanced by knowledge reasoning. In this paper, we design MathGraph, a knowledge graph aiming to solve high school mathematical exercises. Since it requires fine-grained mathematical derivation and calculation of different mathematical objects, the design of MathGraph has major differences from existing knowledge graphs. MathGraph supports massive kinds of mathematical objects, operations, and constraints which may be involved in exercises. Furthermore, we propose an algorithm to align a semantically parsed exercise to MathGraph and figure out the answer automatically. Extensive experiments on real-world datasets verify the effectiveness of MathGraph.

Tianyu Zhao, Yan Huang, Songfan Yang, Yuyu Luo, Jianhua Feng, Yong Wang, Haitao Yuan, Kang Pan, Kaiyu Li, Haoda Li, Fu Zhu

Multi-hop Path Queries over Knowledge Graphs with Neural Memory Networks

There has been increasing research interest in inferring missing information from existing knowledge graphs (KGs) due to the emergence of a wide range of knowledge graph downstream applications such as question answering systems and search engines. Reasoning over knowledge graphs, which queries the correct entities only through a path consisting of multiple consecutive relations/hops from the starting entity, is an effective approach to do this task, but this topic has been rarely studied. As an attempt, the compositional training method equally treats the constructed multi-hop paths and one-hop relations to build training data, and then trains conventional knowledge graph completion models such as TransE in a compositional manner on the training data. However, it does not incorporate additional information along the paths during training, such as the intermediate entities and their types, which can be helpful to guide the reasoning towards the correct destination answers. Moreover, compositional training can only extend some existing models that can be composable, which significantly limits its applicability. Therefore, we design a novel model based on the recently proposed neural memory networks, which have large external memories and flexible writing/reading schemes, to address these problems. Specifically, we first introduce a single network layer, which is then used as the building block for a multi-layer neural network called TravNM, and a flexible memory updating method is developed to facilitate writing intermediate entity information during the multi-hop reasoning into memories. Finally, we conducted extensive experiments on large datasets, and the experimental results show the superiority of our proposed TravNM for reasoning over knowledge graphs with multiple hops.

Qinyong Wang, Hongzhi Yin, Weiqing Wang, Zi Huang, Guibing Guo, Quoc Viet Hung Nguyen

Sentiment Classification by Leveraging the Shared Knowledge from a Sequence of Domains

This paper studies sentiment classification in a setting where a sequence of classification tasks is performed over time. The goal is to leverage the knowledge gained from previous tasks to do better on the new task than without using the previous knowledge. This is a lifelong learning setting. This paper proposes a novel deep learning model for lifelong sentiment classification. The key novelty of the proposed model is that it uses two networks: a knowledge retention network for retaining domain-specific knowledge learned in the past, and a feature learning network for classification feature learning. The two networks work together to perform the classification task. Our experimental results show that the proposed deep learning model outperforms the state-of-the-art baselines.

Guangyi Lv, Shuai Wang, Bing Liu, Enhong Chen, Kun Zhang

Backmatter

Titel: Database Systems for Advanced Applications
herausgegeben von: Prof. Guoliang Li
Jun Yang
Prof. Dr. Joao Gama
Dr. Juggapong Natwichai
Yongxin Tong
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-18576-3
Print ISBN: 978-3-030-18575-6
DOI: https://doi.org/10.1007/978-3-030-18576-3