Skip to main content

2018 | Buch

Database Systems for Advanced Applications

23rd International Conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21-24, 2018, Proceedings, Part I

insite
SUCHEN

Über dieses Buch

This two-volume set LNCS 10827 and LNCS 10828 constitutes the refereed proceedings of the 23rd International Conference on Database Systems for Advanced Applications, DASFAA 2018, held in Gold Coast, QLD, Australia, in May 2018.
The 83 full papers, 21 short papers, 6 industry papers, and 8 demo papers were carefully selected from a total of 360 submissions. The papers are organized around the following topics: network embedding; recommendation; graph and network processing; social network analytics; sequence and temporal data processing; trajectory and streaming data; RDF and knowledge graphs; text and data mining; medical data mining; security and privacy; search and information retrieval; query processing and optimizations; data quality and crowdsourcing; learning models; multimedia data processing; and distributed computing.

Inhaltsverzeichnis

Frontmatter
Erratum to: Free-Rider Episode Screening via Dual Partition Model
Jian Pei, Yannis Manolopoulos, Shazia Sadiq, Jianxin Li

Network Embedding

Frontmatter
Enhancing Network Embedding with Auxiliary Information: An Explicit Matrix Factorization Perspective

Recent advances in the field of network embedding have shown the low-dimensional network representation is playing a critical role in network analysis. However, most of the existing principles of network embedding do not incorporate auxiliary information such as content and labels of nodes flexibly. In this paper, we take a matrix factorization perspective of network embedding, and incorporate structure, content and label information of the network simultaneously. For structure, we validate that the matrix we construct preserves high-order proximities of the network. Label information can be further integrated into the matrix via the process of random walk sampling to enhance the quality of embedding in an unsupervised manner, i.e., without leveraging downstream classifiers. In addition, we generalize the Skip-Gram Negative Sampling model to integrate the content of the network in a matrix factorization framework. As a consequence, network embedding can be learned in a unified framework integrating network structure and node content as well as label information simultaneously. We demonstrate the efficacy of the proposed model with the tasks of semi-supervised node classification and link prediction on a variety of real-world benchmark network datasets.

Junliang Guo, Linli Xu, Xunpeng Huang, Enhong Chen
Attributed Network Embedding with Micro-meso Structure

Recently, network embedding has received a large amount of attention in network analysis. Although some network embedding methods have been developed from different perspectives, on one hand, most of the existing methods only focus on leveraging the plain network structure, ignoring the abundant attribute information of nodes. On the other hand, for some methods integrating the attribute information, only the lower-order proximities (e.g. microscopic proximity structure) are taken into account, which may suffer if there exists the sparsity issue and the attribute information is noisy. To overcome this problem, the attribute information and mesoscopic community structure are utilized. In this paper, we propose a novel network embedding method termed Attributed Network Embedding with Micro-Meso structure (ANEM), which is capable of preserving both the attribute information and the structural information including the microscopic proximity structure and mesoscopic community structure. In particular, both the microscopic proximity structure and node attributes are factorized by Nonnegative Matrix Factorization (NMF), from which the low-dimensional node representations can be obtained. For the mesoscopic community structure, a community membership strength matrix is inferred by a generative model from the linkage structure, which is then factorized by NMF to obtain the low-dimensional node representations. The three components are jointly correlated by the low-dimensional node representations, from which an objective function can be defined. An efficient alternating optimization scheme is proposed to solve the optimization problem. Extensive experiments have been conducted to confirm the superior performance of the proposed model over the state-of-the-art network embedding methods.

Juan-Hui Li, Chang-Dong Wang, Ling Huang, Dong Huang, Jian-Huang Lai, Pei Chen
An Efficient Exact Nearest Neighbor Search by Compounded Embedding

Nearest neighbor search (NNS) in high dimensional space is a fundamental and essential operation in applications from many domains, such as machine learning, databases, multimedia and computer vision. In this paper, we first propose a novel and effective distance lower bound computation technique for Euclidean distance by using the combination of linear and non-linear embedding methods. As such, each point in a high dimensional space can be embedded into a low dimensional space such that the distance between two embedded points lower bounds their distance in the original space. Following the filter-and-verify paradigm, we develop an efficient exact NNS algorithm by pruning candidates using the new lower bounding technique and hence reducing the cost of expensive distance computation in high dimensional space. Our comprehensive experiments on 10 real-life and diverse datasets, including image, video, audio and text data, demonstrate that our new algorithm can significantly outperform the state-of-the-art exact NNS techniques.

Mingjie Li, Ying Zhang, Yifang Sun, Wei Wang, Ivor W. Tsang, Xuemin Lin
BASSI: Balance and Status Combined Signed Network Embedding

Signed social networks have both positive and negative links which convey rich information such as trust or distrust, like or dislike. However, existing network embedding methods mostly focus on unsigned networks and ignore the negative interactions between users. In this paper, we investigate the problem of learning representations for signed networks and present a novel deep network structure to incorporate both the balance and status theory in signed networks. With the proposed framework, we can simultaneously learn the node embedding encoding the status of a node and the edge embedding denoting the sign of an edge. Furthermore, the learnt node and edge embeddings can be directly applied to the sign prediction and node ranking tasks. Experiments on real-world social networks demonstrate that our model significantly outperforms the state-of-the-art baselines.

Yiqi Chen, Tieyun Qian, Ming Zhong, Xuhui Li

Recommendation

Frontmatter
Geographical Relevance Model for Long Tail Point-of-Interest Recommendation

Point-of-Interest (POI) recommendation plays a key role in people’s daily life, and has been widely studied in recent years, due to its increasingly applications (e.g., recommending new restaurants for users). One of important phenomena in the POI recommendation community is the data sparsity, which makes deep impact on the quality of recommendation. Existing works have proposed various models to alleviate the bottleneck of the data sparsity, and most of these works addressed this issue from the user perspective. To the best of our knowledge, few attention has been made to address this issue from the POI perspective. In this paper, we observe that the “long tail” POIs, which have few check-ins and have less opportunity to be exposed, take up a great proportion among all the POIs. It is interesting and meaningful to investigate the long tail POI recommendation from the POI perspective. To this end, this paper proposes a new model, named GRM (geographical relevance model), that expands POI profiles via relevant POIs and employs the geographical information, addressing the limitations of existing models. Experimental results based on two public datasets demonstrate that our model is effective and competitive. It outperforms state-of-the-art models for the long tail POI recommendation problem.

Wei Liu, Zhi-Jie Wang, Bin Yao, Mengdie Nie, Jing Wang, Rui Mao, Jian Yin
Exploiting Context Graph Attention for POI Recommendation in Location-Based Social Networks

The prevalence of mobile devices and the increasing popularity of location-based social networks (LBSNs) generate a large volume of user mobility data. As a result, POI recommendation systems, which play a vital role in connecting users and POIs, have received extensive attention from both research and industry communities in the past few years. The challenges of POI recommendation come from the very sparse user check-in records with only positive feedback and how to integrate heterogeneous information of users and POIs. The state-of-the-art methods usually exploit the social influence from friends and geographical influence from neighboring POIs for recommendation. However, there are two drawbacks that hinder their performance. First, they cannot model the different degree of influence from different friends to a user. Second, they ignore the user check-ins as context information for preference modeling in the collaborative filtering framework.To address the limitations of existing methods, we propose a Context Graph Attention (CGA) model, which can integrate context information encoded in different context graphs with the attention mechanism for POI recommendation. CGA first uses two context-aware attention networks to learn the influence weights of different friends and neighboring POIs respectively. At the same time, it applies a dual attention network, which considers the mutual influence of context POIs for a user and the context users for a POI, to learn the influence weights of different context vertices in the user-POI context graph. A multi-layer perceptron integrates the context vectors of users and POIs for estimating the visiting probability of a user to a POI. To the best of our knowledge, this is the first work that applies the attention mechanism for POI recommendation. Extensive experiments on two public check-in data sets show that CGA can outperform the state-of-the-art methods as well as other attentive collaborative filtering methods substantially.

Siyuan Zhang, Hong Cheng
Restricted Boltzmann Machine Based Active Learning for Sparse Recommendation

In recommender systems, users’ preferences are expressed as ratings (either explicit or implicit) for items. In general, more ratings associated with users or items are elicited, more effective the recommendations are. However, almost all user rating datasets are sparse in the real-world applications. To acquire more ratings, the active learning based methods have been used to selectively choose the items (called interview items) to ask users for rating, inspired by that the usefulness of each rating may vary significantly, i.e., different ratings may bring a different amount of information about the user’s tastes. Nevertheless, existing active learning based methods, including both static methods and decision-tree based methods, encounter the following limitations. First, the interview item set is predefined in the static methods, and they do not consider the user’s responses when asking the next question in the interview process. Second, the interview item set in the decision tree based methods is very small (i.e., usually less than 50 items), which leads to that the interview items cannot fully reflect or capture the diverse user interests, and most items do not have the opportunity to obtain additional ratings. Moreover, these decision tree based methods tend to choose popular items as the interview items instead of items with sparse ratings (i.e., sparse items), resulting in “Harry Potter Effect” (http://bickson.blogspot.com.au/2012/09/harry-potter-effect-on-recommendations.html). To address these limitations, we propose a new active learning framework based on RBM (Restricted Boltzmann Machines) to add ratings for sparse recommendation in this paper. The superiority of this method is demonstrated on two publicly available real-life datasets.

Weiqing Wang, Hongzhi Yin, Zi Huang, Xiaoshuai Sun, Nguyen Quoc Viet Hung
Discrete Binary Hashing Towards Efficient Fashion Recommendation

How to match clothing well is always a troublesome problem in our daily life, especially when we are shopping online to select a pair of matched pieces of clothing from tens of thousands available selections. To help common customers overcome selection difficulties, recent studies in the recommender system area have started to infer the fashion matching results automatically. The conventional fashion recommendation is normally achieved by considering visual similarity of clothing items or/and item co-purchase history from existing shopping transactions. Due to the high complexity of visual features and the lack of historical item purchase records, most of the existing work is unlikely to make an efficient and accurate recommendation. To address the problem, in this paper we propose a new model called Discrete Supervised Fashion Coordinates Hashing (DSFCH). Its main objective is to learn meaningful yet compact high level features of clothing items, which are represented as binary hash codes. In detail, this learning process is supervised by a clothing matching matrix, which is initially constructed based on limited known matching pairs and subsequently on the self-augmented ones. The proposed model jointly learns the intrinsic matching patterns from the matching matrix and the binary representations from the clothing items’ images, where the visual feature of each clothing item is discretized into a fixed-length binary vector. The binary representation learning significantly reduces the memory cost and accelerates the recommendation speed. The experiments compared with several state-of-the-art approaches have evidenced the superior performance of the proposed approach on efficient fashion recommendation.

Luyao Liu, Xingzhong Du, Lei Zhu, Fumin Shen, Zi Huang
Learning Dual Preferences with Non-negative Matrix Tri-Factorization for Top-N Recommender System

In recommender systems, personal characteristic is possessed by not only users but also displaying products. Users have their personal rating patterns while products have different characteristics that attract users. This information can be explicitly exploited from the review text. However, most existing methods only model the review text as a topic preference of products, without considering the perspectives of users and products simultaneously. In this paper, we propose a user-product topic model to capture both user preferences and attractive characteristics of products. Different from conventional collaborative filtering in conjunction with topic models, we use non-negative matrix tri-factorization to jointly reveal the characteristic of users and products. Experiments on two real-world data sets validate the effectiveness of our method in Top-N recommendations.

Xiangsheng Li, Yanghui Rao, Haoran Xie, Yufu Chen, Raymond Y. K. Lau, Fu Lee Wang, Jian Yin
Low-Rank and Sparse Cross-Domain Recommendation Algorithm

In this paper, we propose a novel Cross-Domain Collaborative Filtering (CDCF) algorithm termed Low-rank and Sparse Cross-Domain (LSCD) recommendation algorithm. Different from most of the CDCF algorithms which tri-factorize the rating matrix of each domain into three low dimensional matrices, LSCD extracts a user and an item latent feature matrix for each domain respectively. Besides, in order to improve the performance of recommendations among correlated domains by transferring knowledge and uncorrelated domains by differentiating features in different domains, the features of users are separated into shared and domain-specific parts adaptively. Specifically, a low-rank matrix is used to capture the shared feature subspace of users and a sparse matrix is used to characterize the discriminative features in each specific domain. Extensive experiments on two real-world datasets have been conducted to confirm that the proposed algorithm transfers knowledge in a better way to improve the quality of recommendation and outperforms the state-of-the-art recommendation algorithms.

Zhi-Lin Zhao, Ling Huang, Chang-Dong Wang, Dong Huang
Cross-Domain Recommendation for Cold-Start Users via Neighborhood Based Feature Mapping

Traditional Collaborative Filtering (CF) models mainly focus on predicting a user’s preference to the items in a single domain such as the movie domain or the music domain. A major challenge for such models is the data sparsity problem, and especially, CF cannot make accurate predictions for the cold-start users who have no ratings at all. Although Cross-Domain Collaborative Filtering (CDCF) is proposed for effectively transferring users’ rating preference across different domains, it is still difficult for existing CDCF models to tackle the cold-start users in the target domain due to the extreme data sparsity. In this paper, we propose a Cross-Domain Latent Feature Mapping (CDLFM) model for cold-start users in the target domain. Firstly, the user rating behavior is taken into consideration in the matrix factorization for alleviating the data sparsity. Secondly, neighborhood based latent feature mapping is proposed to transfer the latent features of a cold-start user from the auxiliary domain to the target domain. Extensive experiments on two real datasets extracted from Amazon transaction data demonstrate the superiority of our proposed model against other state-of-the-art methods.

Xinghua Wang, Zhaohui Peng, Senzhang Wang, Philip S. Yu, Wenjing Fu, Xiaoguang Hong

Graph and Network Data Processing

Frontmatter
K-Connected Cores Computation in Large Dual Networks

Computing $$k\text {-}core$$s is a fundamental and important graph problem, which can be applied in many areas, such as community detection, network visualization, and network topology analysis. Due to the complex relationship between different entities, dual graph widely exists in the applications. A dual graph contains a physical graph and a conceptual graph, both of which have the same vertex set. Given that there exist no previous studies on the $$k\text {-}core$$ in dual graphs, we formulate a k-connected core ($$k\text {-}CCO$$) model in dual graphs. A $$k\text {-}CCO$$ is a $$k\text {-}core$$ in the conceptual graph, and also connected in the physical graph. Given a dual graph and an integer k, we propose a polynomial time algorithm for computing all $$k\text {-}CCO$$s. We also propose three algorithms for computing all maximum-connected cores ($$MCCO$$), which are the existing $$k\text {-}CCO$$s such that a $$(k+1)$$-$$CCO$$ does not exist. We conduct extensive experiments on six real-world datasets and several synthetic datasets. The experimental results demonstrate the effectiveness and efficiency of our proposed algorithms.

Lingxi Yue, Dong Wen, Lizhen Cui, Lu Qin, Yongqing Zheng
Graph Clustering with Local Density-Cut

In this paper, we introduce a new graph clustering algorithm, called Dcut. The basic idea is to envision the graph clustering as a local density-cut problem. To identify meaningful communities in a graph, a density-connected tree is first constructed in a local fashion. Building upon the local intuitive density-connected tree, Dcut allows partitioning a graph into multiple densely tight-knit clusters effectively and efficiently. We have demonstrated that our method has several attractive benefits: (a) Dcut provides an intuitive criterion to evaluate the goodness of a graph clustering in a more precise way; (b) Building upon the density-connected tree, Dcut allows identifying high-quality clusters; (c) The density-connected tree also provides a connectivity map of vertices in a graph from a local density perspective. We systematically evaluate our new clustering approach on synthetic and real-world data sets to demonstrate its good performance.

Junming Shao, Qinli Yang, Zhong Zhang, Jinhu Liu, Stefan Kramer
External Topological Sorting in Large Graphs

Topological sorting is a fundamental problem in graph analysis. Given the fact that real world graphs grow rapidly so that they cannot entirely reside in main memory, in this paper, we study external memory algorithms for the topological sorting problem. We propose a contraction-expansion paradigm and devise an external memory algorithm based on the paradigm for the topological sorting problem. Our new algorithm is efficient due to the introduction of the new paradigm and can be implemented easily by using the fundamental external memory primitives. We conduct extensive experiments on real and synthesis graphs and the results demonstrate the efficiency of our proposed algorithm.

Zhu Qing, Long Yuan, Fan Zhang, Lu Qin, Xuemin Lin, Wenjie Zhang
Finding All Nearest Neighbors with a Single Graph Traversal

Finding the nearest neighbor is a key operation in data analysis and mining. An important variant of nearest neighbor query is the all nearest neighbor (ANN) query, which reports all nearest neighbors for a given set of query objects. Existing studies on ANN queries have focused on Euclidean space. Given the widespread occurrence of spatial networks in urban environments, we study the ANN query in spatial network settings. An example of an ANN query on spatial networks is finding the nearest car parks for all cars currently on the road. We propose VIVET, an index-based algorithm to efficiently process ANN queries. VIVET performs a single traversal on a spatial network to precompute the nearest data object for every vertex in the network, which enables us to answer an ANN query through a simple lookup on the precomputed nearest neighbors. We analyze the cost of the proposed algorithm both theoretically and empirically. Our results show that the algorithm is highly efficient and scalable. It outperforms adapted state-of-the-art nearest neighbor algorithms in both precomputation and query processing costs by more than one order of magnitude.

Yixin Xu, Jianzhong Qi, Renata Borovica-Gajic, Lars Kulik
Towards Efficient Path Skyline Computation in Bicriteria Networks

Path skyline query is a fundamental problem in bicriteria network analysis and is widely applied in a variety of applications. Given a source s and a destination t in a bicriteria network G, path skyline query aims to identify all the skyline paths from s to t in G. In the literature, $$\mathsf {PSQ}$$ is a fundamental algorithm for path skyline query and is also used as a building block for the afterwards proposed algorithms. In $$\mathsf {PSQ}$$, a key operation is to record the skyline paths from s to v for each node v that is possible on the skyline paths from s to t. However, to obtain the skyline paths for v, $$\mathsf {PSQ}$$ has to maintain other paths that are not skyline paths for v, which makes $$\mathsf {PSQ}$$ inefficient. Motivated by this, in this paper, we propose a new algorithm $$\mathsf {PSQ^+}$$ for the path skyline query. By adopting an ordered path exploring strategy, our algorithm can totally avoid the fruitless path maintenance problem in $$\mathsf {PSQ}$$. We evaluate our proposed algorithm on real networks and the experimental results demonstrate the efficiency of our proposed algorithm. Besides, the experimental results also demonstrate the algorithm that uses $$\mathsf {PSQ}$$ as a building block for the path skyline query can achieve a significant performance improvement after we substitute $$\mathsf {PSQ^+}$$ for $$\mathsf {PSQ}$$.

Dian Ouyang, Long Yuan, Fan Zhang, Lu Qin, Xuemin Lin
Answering Why-Not Questions on Structural Graph Clustering

Structuralgraphclustering is one fundamental problem in managing and analyzing graph data. As a fast and exact density based graph clustering algorithm, pSCAN is widely used to discover meaningful clusters in many different graph applications. The problem of explaining why-not questions on pSCAN is to find why an expected vertex is not included in the specified cluster of the pSCAN results. Obviously, the pSCAN results are sensitive to two parameters: (i) the similarity threshold $$\epsilon $$; and (ii) the density constraint $$\mu $$, when them are not set good enough, some expected vertices would be missing in the specified clusters. To tackle this problem, we firstly analyze that how the parameters affect the results of pSCAN, then we propose two novel explanation algorithms to explain why-not questions on pSCAN by offering some advices on how to refine the initial pSCAN with minimum penalty from two perspectives: (i) modifying the parameter $$\epsilon $$; and (ii) modifying the parameter $$\mu $$. Moreover, we present some constraints to ensure the original pSCAN results are retained as much as possible in the results of refined pSCAN. Finally, we conduct comprehensive experimental studies, which show that our approaches can efficiently return high-quality explanations for why-not questions on pSCAN.

Chuanyu Zong, Xiufeng Xia, Bin Wang, Xiaochun Yang, Jiajia Li, Xiangyu Liu, Rui Zhu
SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

Mining graphlet statistics is very meaningful due to its wide applications in social networks, bioinformatics and information security, etc. However, it is a big challenge to exactly count graphlet statistics as the number of subgraphs exponentially increases with the graph size, so sampling algorithms are widely used to estimate graphlet statistics within reasonable time. However, existing sampling algorithms are not scalable for large graphlets, e.g., they may get stuck when estimating graphlets with more than five nodes. To address this issue, we propose a highly scalable algorithm, Scalable subgraph Sampling via Random Walk (SSRW), for graphlet counts and concentrations. SSRW samples graphlets by generating new nodes from the neighbors of previously visited nodes instead of fixed ones. Thanks to this flexibility, we can generate anyk-graphlets in a unified way and estimate statistics of k-graphlet efficiently even for large k. Our extensive experiments on estimating counts and concentrations of $$\{4,5,6,7\}$$-graphlets show that SSRW algorithm is scalable, accurate and fast.

Chen Yang, Min Lyu, Yongkun Li, Qianqian Zhao, Yinlong Xu
Multi-metric Graph Query Performance Prediction

We propose a general framework for predicting graph query performance with respect to three performance metrics: execution time, query answer quality, and memory consumption. The learning framework generates and makes use of informative statistics from data and query structure and employs a multi-label regression model to predict the multi-metric query performance. We apply the framework to study two common graph query classes—reachability and graph pattern matching; the two classes differ significantly in their query complexity. For both query classes, we develop suitable performance models and learning algorithms to predict the performance. We demonstrate the efficacy of our framework via experiments on real-world information and social networks. Furthermore, by leveraging the framework, we propose a novel workload optimization algorithm and show that it improves the efficiency of workload management by 54% on average.

Keyvan Sasani, Mohammad Hossein Namaki, Yinghui Wu, Assefaw H. Gebremedhin
A Privacy-Preserving Framework for Subgraph Pattern Matching in Cloud

The growing popularity of storing large data graphs in cloud has inspired the emergence of subgraph pattern matching on a remote cloud, which is usually defined in terms of subgraph isomorphism. However, it is an NP-complete problem and too strict to find useful matches in certain applications. In addition, there exists another important concern, i.e., how to protect the privacy of data graphs in subgraph pattern matching without undermining matching results. To tackle these problems, we propose a novel framework to achieve the privacy-preserving subgraph pattern matching via strong simulation in cloud. Firstly, we develop a k-automorphism model based method to protect structural privacy in data graphs. Additionally, we use a cost-model based label generalization method to protect label privacy in both data graphs and pattern graphs. Owing to the symmetry in a k-automorphic graph, the subgraph pattern matching can be answered using the outsourced graph, which is only a subset of a k-automorphic graph. The efficiency of subgraph pattern matching can be greatly improved by this way. Extensive experiments on real-world datasets demonstrate the high efficiency and effectiveness of our framework.

Jiuru Gao, Jiajie Xu, Guanfeng Liu, Wei Chen, Hongzhi Yin, Lei Zhao
Adaptive and Parallel Data Acquisition from Online Big Graphs

Acquisition of contents from online big graphs (OBGs) like linked Web pages, social networks and knowledge graphs, is critical as data infrastructure for Web applications and massive data analysis. However, effective data acquisition is challenging due to the massive, heterogeneous, dynamically evolving properties of OBGs with unknown global topological structures. In this paper, we give an adaptive and parallel approach for effective data acquisition from OBGs. We adopt the ideas of Quasi Monte Carlo (QMC) and branch & bound methods to propose an adaptive Web-scale sampling algorithm for parallel data collection implemented upon Spark. Experimental results show the effectiveness and efficiency of our method.

Zidu Yin, Kun Yue, Hao Wu, Yingjie Su
Answering the Why-Not Questions of Graph Query Autocompletion

Graph query autocompletion (gQAC) helps users formulate graph queries in a visual environment (a.k.a GUI). It takes a graph query that the user is formulating as input and generates a ranked list of query suggestions. Since it is impossible to accurately predict the user’s target query, the current state-of-the-art of gQAC sometimes fails to produce useful suggestions. In such scenarios, it is natural for the user to ask why are useful suggestions not returned. In this paper, we address the why-not questions of gQAC. Specifically, given an intermediate query q, a target query $$q_t$$, and a gQAC system X, the why-not questions of gQAC seek for the minimal refinement of the configuration of X, with respect to a penalty model, such that at least one useful suggestion towards $$q_t$$ appears in the returned suggestions. We propose a generic ranking function for existing gQAC systems. We propose a search algorithm for the why-not questions.

Guozhong Li, Nathan Ng, Peipei Yi, Zhiwei Zhang, Byron Choi
Exploiting Reshaping Subgraphs from Bilateral Propagation Graphs

Given a graph over which defects, viruses, or contagions spread, leveraging a set of highly correlated subgraphs is an appealing research area with many applications. However, the challenges abound. Firstly, an initial defect in one node can cause different defects in other nodes. Second, while the time is the most significant medium to understand diffusion processes, it is not clear when the members of a subgraph may change. Third, given a pair of nodes, a contagion can spread in both directions. Previous works only consider the sequential time-window and suppose that the contagion may spread from one node to the other during a predefined time span. But the propagation can differ in various temporal dimensions (e.g. hours and days). Therefore, we propose a framework that takes both sequential and multi-aspect attributes of the time into consideration. Moreover, we devise an empirical model to estimate how frequently the subgraphs may reshape. Experiment show that our framework can effectively leverage the reshaping subgraphs.

Saeid Hosseini, Hongzhi Yin, Ngai-Man Cheung, Kan Pak Leng, Yuval Elovici, Xiaofang Zhou

Social Network Analytics

Frontmatter
Sample Location Selection for Efficient Distance-Aware Influence Maximization in Geo-Social Networks

In geo-social networks, the distances of users to a location play an important role in populating the business or campaign at the location. Thereby, the problem of Distance-Aware Influence Maximization (DAIM) has been investigated recently. The efficiency of DAIM computation heavily relies on the sample location selection, because the online seeding performance is sensitive to the distance between sample location and promoted location, and the offline precomputation performance is sensitive to the number of samples. However, there is no work to fully study the problem of sample location selection w.r.t. DAIM in geo-social networks. To do this, we first formalize the problem under a reasonable assumption that a promoted location always adheres to the distribution of users. Then, we propose an efficient location sampling approach based on the heuristic anchor point selection and facility allocation techniques. Our experimental results on two real datasets demonstrate that our approach can improve the online and offline efficiency of DAIM approach like [9] by orders of magnitude.

Ming Zhong, Qian Zeng, Yuanyuan Zhu, Jianxin Li, Tieyun Qian
Identifying Topical Opinion Leaders in Social Community Question Answering

Social community question answering (SCQA) sites not only provide regular question answering (QA) service but also form a social network where users can follow each other. Identifying topical opinion leaders who are both expert and influential in SCQA becomes a hot research topic. However, existing works focus on either using knowledge expertise to find experts for improving the quality of answers, or measuring user influence to identify influential ones. In this paper, we propose QALeaderRank, a topical opinion leader identification framework, incorporating both the topic-sensitive influence and the topical knowledge expertise. To measure a user’s topic-sensitive influence, we design a novel ranking algorithm that exploits both the social and QA features of SCQA, taking account of the network structure, topical similarity and knowledge authority. Besides, we incorporate three topic-relevant metrics to infer the topical expertise. Extensive experiments along with a user study demonstrate that QALeaderRank outperforms the compared state-of-the-art methods. QALeaderRank can also be used to identify multi-topic opinion leaders.

Tao Zhao, Hong Huang, Xiaoming Fu
Personalized Geo-Social Group Queries in Location-Based Social Networks

Geo-social group query, one of the most important issues in LBSNs, combines both location and social factors to generate useful computational results, which is attracting increasing interests from both industrial and academic communities. In this paper, we propose a new type of queries, personalized geo-social group (PGSG) queries, which aim to retrieve both a user group and a venue. Specifically, a PGSG query intends to find a group-venue pattern (consisting of a venue and a group of users with size h), where each user in the group is socially connected with at least c other users in the group and the maximum distance of all the users in the group to the venue is minimized. To tackle the problem of the PGSG query, we propose GVPS, a novel search algorithm to find the optimal user group and venue simultaneously. Moreover, we extend the PGSG query to top-k personalized geo-social group (TkPGSG) query. Instead of finding the optimal solution in the PGSG query, the TkPGSG query is to return multiple feasibility solutions to guarantee the diversity. We propose an advanced search algorithm TkPH to address the TkPGSG query. Comprehensive experimental results demonstrate the efficiency and effectiveness of our proposed approaches in processing the PGSG query and the TkPGSG query on large real-world datasets.

Yuliang Ma, Ye Yuan, Guoren Wang, Xin Bi, Yishu Wang
Tracking Dynamic Magnet Communities: Insights from a Network Perspective

Communities, such as user groups, companies and countries, are important objects in social systems. Recently, researchers have proposed numerous quantitative indicators to measure the attractiveness of a community. However, most of these indicators are mainly under static settings and lack the predictive power of future impact/attractiveness. Meanwhile, in many real-world applications, especially in finance, it is of great interest for the stakeholders to identify the communities, not necessarily the most influential ones at the moment, but the future leaders for years to come. Given the increasing availability of entity-community interaction evolution records, it’s natural to exploit them to model the network changes of communities. We refer the change of community interaction as attention flow and define communities that will sustainably attract more entities’ attentions than others in a future time interval as dynamic magnet communities. We study the problem of dynamic magnet community identification based on entity-community interaction evolution records. Two major challenges are identified as follows: (1) temporal dynamics, it’s difficult to model the rising-declining trend of interactions; (2) sustainability constraints, the effect of attention flow on community prosperity is complex, where too rapid attention growth increases the corruption risks. In response, we propose to model the interaction network evolution of different communities over time by lasso based growth curve fitting. Taking sustainable attention flow into account, we measure attention flow utility from benefit and risk perspectives, and further present a hybrid approach of local and global ranking to track dynamic magnet communities. Due to the lack of dataset for testing, we collected a dataset of international business merger and acquisition network among different countries in the world. The experimental results demonstrate the effectiveness of our proposed model.

Chang Liao, Yun Xiong, Xiangnan Kong, Yangyong Zhu
Discovering Strong Communities with User Engagement and Tie Strength

In this paper, we propose and study a novel cohesive subgraph model, named ($$k$$,$$s$$)-core, which requires each user to have at least k familiars or friends (not just acquaintances) in the subgraph. The model considers both user engagement and tie strength to discover strong communities. We compare the ($$k$$,$$s$$)-core model with $$k$$-core and $$k$$-truss theoretically and experimentally. We propose efficient algorithms to compute the ($$k$$,$$s$$)-core and decompose the graph by a particular sub-model $$k$$-fami. Extensive experiments show (1) our ($$k$$,$$s$$)-core and $$k$$-fami are effective cohesive subgraph models and (2) the ($$k$$,$$s$$)-core computation and $$k$$-fami decomposition are efficient on various real-life social networks.

Fan Zhang, Long Yuan, Ying Zhang, Lu Qin, Xuemin Lin, Alexander Zhou
Functional-Oriented Relationship Strength Estimation: From Online Events to Offline Interactions

Link mining/analysis over network has received widespread attention from researchers. Recently, there has been growing interest in measuring relationship strength between entities based on attribute similarity. However, limited work has assessed the competitive advantage of functional elements in relationship strength quantification. The functional elements embody the growth/development nature of the relationship. Motivated by the availability of large volumes of online event records that can potentially reveal underlying functional socio-economic characteristics, we study the problem of offline relationship strength estimation with functional elements awareness from online events. Two major challenges are identified as follows: (1) informal information, online events are of high dimensions, and not all the learnt functions of online events are predictive to offline interactions; (2) heterogeneous dependency, it’s hard to measure the relationship strength by modeling functional elements with network effects jointly. To handle these challenges, we propose generalized relationship strength estimation model (gStrength), a novel approach for relationship strength estimation. First, we define the combination of latent roles and observed groups as generalized roles, and present generalized role constrained latent topic model to make the extracted latent functions compatible with offline interactions. Second, we model the functional elements and further extend them to structural dependency settings to quantify relationship strength. We apply this approach to the political and economic application scenario of measuring international investment relations. The experimental results demonstrate the effectiveness of the proposed method.

Chang Liao, Yun Xiong, Xiangnan Kong, Yangyong Zhu, Shimin Zhao, Shanshan Li
Incremental and Adaptive Topic Detection over Social Media

Social media like Twitter and Facebook are very popular nowadays for sharing users’ interests. However, the existing solutions on topic detection over social media overlook time and location factors, which are quite important and useful. Moreover, social media are frequently updated. Thus, the proposed detection model should handle the dynamic updates. In this paper, we introduce a topic model for topic detection that combines time and location. Our model is equipped with incremental estimation of the parameters of the topic model and adaptive window length according to the correlation of consecutive windows and their density. We have conducted extensive experiments to verify the effectiveness and efficiency of our proposed Incremental Adaptive Time Location (IncrAdapTL) model.

Konstantinos Giannakopoulos, Lei Chen
Incorporating User Grouping into Retweeting Behavior Modeling

The variety among massive users makes it difficult to model their retweeting activities. Obviously, it is not suitable to cover the overall users by a single model. Meanwhile, building one model per user is not practical. To this end, this paper presents a novel solution, of which the principle is to model the retweeting behavior over user groups. Our system, GruBa, consists of three key components for extracting user based features, clustering users into groups, and modeling upon each group. Particularly, we look into the user interest from different perspectives including long-term/short-term interests and explicit/implicit interests. We have evaluated the performance of GruBa using datasets of real-world social networking applications, showcasing its benefits.

Jinhai Zhu, Shuai Ma, Hui Zhang, Chunming Hu, Xiong Li
Maximizing Social Influence for the Awareness Threshold Model

Given a social network G, the Influence Maximization (IM) problem aims to find a seed set$$S \subseteq G$$ of k users. These users are advertised, or activated, through marketing campaigns, with the hope that they will continue to influence others in G (e.g., by spreading messages about a new book). The goal of IM is to find the set S that achieves an optimal advertising effect or expected spread (e.g., make the largest number of users in G know about the book).Existing IM solutions make extensive use of propagation models, such as Linear Threshold (LT) or the Independent Cascade (IC). These models define the activation probability, or the chance that a user successfully gets activated by his/her neighbors in G. Although these models are well-studied, they overlook the fact that a user’s influence on others decreases with time. This can lead to an over-estimation of activation probabilities, as well as the expected spread.To address the drawbacks of LT and IC, we develop a new propagation model, called Awareness Threshold (or AT), which considers the fact that a user’s influence decays with time. We further study the Scheduled Influence Maximization (or SIM), to find out the set S of users to activate, as well as when they should be activated. The SIM problem considers the time-decaying nature of influence based on the AT model. We show that the problem is NP-hard, and we develop three approximation solutions with accuracy guarantees. Extensive experiments on real social networks show that (1) AT yields a more accurate estimation of activation probability; and (2) Solutions to the SIM gives a better expected spread than IM algorithms on the AT model.

Haiqi Sun, Reynold Cheng, Xiaokui Xiao, Jing Yan, Yudian Zheng, Yuqiu Qian
A Time-Aware Path-Based Publish/Subscribe Framework

Nowadays, massive geo-tagged records are generated on the social media. These records are useful when the users intend to plan a trip and are interested in some specific topics along the trip. With such redundant records, a publish/subscribe system has been designed to allow the users who are interested in certain information (i.e., the subscribers) to receive messages from some message generators (i.e., the publishers). Existing efforts on publish/subscribe mainly focus on the textual content or the spatial location of the subscribers, while leaving the consideration of incorporating the subscribers’ moving behaviors and temporal information. Therefore, in this paper, we propose a Time-aware Path-based Publish/Subscribe (TPPS) model, where we propose a filtering-verification framework that contains two kinds of filters, i.e., time-aware location-based filter and time-aware region-based filter, with considering both temporal information and moving behaviors, and filtering unrelated subscriptions for each message. We evaluate the efficiency of our approach on a real-world dataset and the experimental results demonstrate the superiority of our method in both efficiency and effectiveness.

Mengdi Jia, Yan Zhao, Bolong Zheng, Guanfeng Liu, Kai Zheng
Direction Recovery in Undirected Social Networks Based on Community Structure and Popularity

Directionality is a significant property of social networks, which enables us to improve our analytical tasks and have a deeper understanding about social networks. Unfortunately, the potential directionality is hidden in undirected social networks. The previous studies on recovering directionality in undirected social networks mostly focus on the microscopic patterns discovered in the existing directed social networks. In this paper, we attempt to recover the directionality based on the macroscopic community structure. To this end, a variant of the existing modularity model, called behavioural modularity, is designed for discovering community membership of nodes. Assuming that members in the same community have higher behavioural similarity, we introduce the concept of the intra-community popularity, and then estimate directionality of undirected ties based on the community structure and the intra-community popularity. Accordingly, we propose a novel Community and Popularity based Direction Recovering (CPDR) approach to recover the directionality of undirected social networks. Experimental results conducted on three real-world social networks have confirmed the effectiveness of the proposed approach on direction recovery.

Yi-Ming Wen, Chang-Dong Wang, Kun-Yu Lin
Detecting Top-k Active Inter-Community Jumpers in Dynamic Information Networks

Dynamic information networks, containing evolving objects and links, exist in various applications. Mining such networks is more challenging than mining static ones. In this paper, we propose a novel concept of ActiveInter-Community Jumpers (AICJumpers) for dynamic information networks, which are objects changing communities frequently over time. Given communities of several snapshots in a dynamic network, we devise a time-efficiency top-kAICJumpers detection algorithm with a sliding window model. After denoting the jump score which captures how frequently an object changes communities over time, we encode the community changing trajectory of each object as bit vectors and transform jump scores computation into bitwise and, or and xor operations between bit vectors. We further propose a slide-based strategy for space and time saving. Experiments on both real and synthetic datasets show high effectiveness and efficiency of our methods as well as the significance of the AICJumper concept.

Xinrui Wang, Hong Gao, Jinbao Wang, Tianbai Yue, Jianzhong Li

Sequence and Temporal Data Processing

Frontmatter
Distributed In-Memory Analytics for Big Temporal Data

The temporal data is ubiquitous, and massive amount of temporal data is generated nowadays. Management of big temporal data is important yet challenging. Processing big temporal data using a distributed system is a desired choice. However, existing distributed systems/methods either cannot support native queries, or are disk-based solutions, which could not well satisfy the requirements of high throughput and low latency. To alleviate this issue, this paper proposes an In-memory based Two-level Index Solution in Spark (ITISS) for processing big temporal data. The framework of our system is easy to understand and implement, but without loss of efficiency. We conduct extensive experiments to verify the performance of our solution. Experimental results based on both real and synthetic datasets consistently demonstrate that our solution is efficient and competitive.

Bin Yao, Wei Zhang, Zhi-Jie Wang, Zhongpu Chen, Shuo Shang, Kai Zheng, Minyi Guo
Scalable Active Constrained Clustering for Temporal Data

In this paper, we introduce a novel interactive framework to handle both instance-level and temporal smoothness constraints for clustering large temporal data. It consists of a constrained clustering algorithm, called CVQE+, which optimizes the clustering quality, constraint violation and the historical cost between consecutive data snapshots. At the center of our framework is a simple yet effective active learning technique, named Border, for iteratively selecting the most informative pairs of objects to query users about, and updating the clustering with new constraints. Those constraints are then propagated inside each data snapshot and between snapshots via two schemes, called constraint inheritance and constraint propagation, to further enhance the results. Experiments show better or comparable clustering results than state-of-the-art techniques as well as high scalability for large datasets.

Son T. Mai, Sihem Amer-Yahia, Ahlame Douzal Chouakria, Ky T. Nguyen, Anh-Duong Nguyen
Nearest Subspace with Discriminative Regularization for Time Series Classification

For time series classification (TSC) problem, many studies focus on elastic distance measures for comparing time series and complete the task with the help of Nearest Neighbour (NN) classifier. This is mainly due to the fact that the order of variables is a crucial factor for time series. Unlike the NN classifier only considers one training sample, in this paper, we propose an improved Nearest Subspace (NS) classifier to classify new time series. By adding a discriminative regularization item, the improved NS classifier takes full advantage of all training time series of one class. Two kinds of discriminative regularization items are employed in our method. One is directly calculated based on Euclidean distance of time series. For the other, we obtain the regularization items from a lower-dimensional subspace. Two well-known dimensional reduction methods, Generalized Eigenvector Method (GEM) and Local Fisher Discriminant Analysis (LFDA), are employed to complete this task. Furthermore, we combine these improved NS classifiers through ensemble schemes to accommodate different time series datasets. Through extensive experiments on all UCR and UEA datasets, we demonstrate that the proposed method can gain better performance than NN classifiers with different elastic distance measures and other classifiers.

Zhenguo Zhang, Yanlong Wen, Ying Zhang, Xiaojie Yuan
Efficient Approximate Subsequence Matching Using Hybrid Signatures

In this paper, we focus on the problem of approximate subsequence matching, also called the read mapping problem in genomics, which is finding similar subsequences (A subsequence refers to a substring which has consecutive characters) of a query (DNA subsequence) from a reference genome under a user-specified similarity threshold k. Existing methods first extract subsequences from a query to generate signatures, then produce candidate positions using the generated signatures, and finally verify these candidate positions to obtain the true mapping positions. However, there exist two main issues in these works: (1) producing many candidate positions; and (2) generating large numbers of signatures, among which many signatures are redundant. To address the above two issues, we propose a novel filtering technique, called hybrid signatures, which can achieve a better balance between the filtering ability of signatures and the overhead of producing candidate positions. Accordingly, we devise an adaptive algorithm to produce candidate positions using hybrid signatures. Finally, the experimental results on real-world genomic sequences show that our method outperforms state-of-the-art methods in query efficiency.

Tao Qiu, Xiaochun Yang, Bin Wang, Yutong Han, Siyao Wang

Trajectory and Streaming Data

Frontmatter
MDTK: Bandwidth-Saving Framework for Distributed Top-k Similar Trajectory Query

During the past decade, with the popularity of smartphones and other mobile devices, big trajectory data is generated and stored in a distributed way. In this work, we focus on the DTW distance based top-k query over the distributed trajectory data. Processing such a query is challenging due to the limited network bandwidth and the computation overhead. To overcome these challenges, we propose a communication-saving framework MDTK (Multi-resolution based Distributed Top-K). MDTK sends the bounding envelopes of the reference trajectory from coarse to finer-grained resolutions and devises a level-increasing communication strategy to gradually tighten the proposed upper and lower bound. Then, distance bound based pruning strategies are imported to reduce both the computation and communication cost. Besides, we embed techniques including: indexing, early-stopping and cascade pruning, to improve the query efficiency. Extensive experiments on real datasets show that MDTK outperforms the state-of-the-art method.

Zhigang Zhang, Jiali Mao, Cheqing Jin, Aoying Zhou
Modeling Travel Behavior Similarity with Trajectory Embedding

The prevalence of GPS-enabled devices and wireless communication technologies has led to myriads of spatial trajectories describing the movement history of moving objects. While a substantial research effort has been undertaken on the spatio-temporal features of trajectory data, recent years have witnessed the flourish of location-based web applications (i.e., Foursquare, Facebook), enriching the traditional trajectory data by associating locations with activity information, called activity trajectory. These trajectory data contain a wealth of activity information and offer unprecedented opportunities for heightening our understanding about human behaviors. In this paper, we propose a novel framework, called TEH (Trajectory Embedding and Hashing), to mine the similarity among users based on their activity trajectories. Such user similarity is of great importance for individuals to effectively retrieve the information with high relevance. With the time being separated into several slots according to the activity-based temporal distribution, we utilize trajectory embedding technique to mine the sequence property of the activity trajectories by treating them as paragraphs. Then a hash-based method is presented to reduce the dimensions for improving the efficiency of users’ similarity calculation. Finally, extensive experiments on a real activity trajectory dataset demonstrate the effectiveness and efficiency of the proposed methods.

Wenyan Yang, Yan Zhao, Bolong Zheng, Guanfeng Liu, Kai Zheng
MaxBRkNN Queries for Streaming Geo-Data

The problem of maximizing bichromatic reverse k nearest neighbor queries (MaxBR$$k$$NN) has been extensively studied in spatial databases, where given a set of facilities and a set of customers, a MaxBR$$k$$NN query returns a region to establish a new facility p such that p is a $$k$$NN of the maximum number of customers. In the literature, current solutions for MaxBR$$k$$NN queries are predominantly static. However, there are numerous applications for dynamic variations of these queries, including advertisements and resource reallocation based on streaming customer locations via social media check-ins, or GPS location updates from mobile devices. In this paper, we address the problem of continuous MaxBR$$k$$NN queries for streaming objects (customers). As customer data can arrive at a very high rate, we adopt two different models for recency information (sliding windows and micro-batching). We propose an efficient solution where results are incrementally updated by reusing computations from the previous result. We present a safe interval to reduce the number of computations for the new objects, and prune the objects that cannot affect the result. We perform extensive experiments on datasets integrated from four different real-life data sources, and demonstrate the efficiency of our solution by rigorously comparing how different properties of the datasets can affect the performance.

Hui Luo, Farhana M. Choudhury, Zhifeng Bao, J. Shane Culpepper, Bang Zhang
Free-Rider Episode Screening via Dual Partition Model

One of the drawbacks of frequent episode mining is that overwhelmingly many of the discovered patterns are redundant. Free-rider episode, as a typical example, consists of a real pattern doped with some additional noise events. Because of the possible high support of the inside noise events, such free-rider episodes may have abnormally high support that they cannot be filtered by frequency based framework. An effective technique for filtering free-rider episodes is using a partition model to divide an episode into two consecutive subepisodes and comparing the observed support of such episode with its expected support under the assumption that these two subepisodes occur independently. In this paper, we take more complex subepisodes into consideration and develop a novel partition model named EDP for free-rider episode filtering from a given set of episodes. It combines (1) a dual partition strategy which divides an episode to an underlying real pattern and potential noises; (2) a novel definition of the expected support of a free-rider episode based on the proposed partition strategy. We can deem the episode interesting if the observed support is substantially higher than the expected support estimated by our model. The experiments on synthetic and real-world datasets demonstrate EDP can effectively filter free-rider episodes compared with existing state-of-the-arts.

Xiang Ao, Yang Liu, Zhen Huang, Luo Zuo, Qing He
Maximize Spatial Influence of Facility Bundle Considering Reverse k Nearest Neighbors

Consider a two dimensional Euclidean space, let F be a set of points representing facilities and U be a set of points representing users. Spatial influence of a facility is the number of users who have this facility as one of their k nearest facilities. This is because that users normally prefer to go to their nearby facilities and are naturally influenced the most by their nearest facilities. Given a facility bundle of size t, spatial influence of the bundle is the number of distinct users influenced by any one of them. Existing works on facility selection problem find out top-t facilities with the highest spatial influence. However, the literature lacks study on this problem when the t facilities have the highest spatial influence as a bundle. We are the first to study the problem of Maximizing Bundled Reverse k Nearest Neighbors (MB-RkNN), where the spatial influence of a facility bundle of size t is maximized. We prove its NP-hardness, and propose a branch-and-bound best first search algorithm that greedily select the currently best facility until we get t facilities. We introduce the concept of kNN region such that a group of users have their kNN facilities all belong to the same kNN region. This sharing property of kNN region allows us to avoid redundant calculation with dynamic programming technique. We conduct experiments on real data sets and show that our algorithm is orders of magnitudes better than our baseline algorithm both in terms of CPU time and IO cost.

Shenlu Wang, Ying Zhang, Xuemin Lin, Muhammad Aamir Cheema
A Road-Aware Neural Network for Multi-step Vehicle Trajectory Prediction

Multi-step vehicle trajectory prediction has been of great significance for location-based services, e.g., actionable advertising. Prior works focused on adopting pattern-matching techniques or HMM-based models, where the ability of accurate prediction is limited since patterns and features are mostly extracted from historical trajectories. However, these methods may become weak to multi-step trajectory prediction when new patterns appear or the previous trajectory is incomplete.In this paper, we propose a neural network model combining road-aware features to solve multi-step vehicle trajectory prediction task. We introduce a novel way of extracting road-aware features for vehicle trajectory, which consist of intra-road feature and inter-road feature extracted from road networks. The utilization of road-aware features helps to draw the latent patterns more accurately and enhances the prediction performances. Then we leverage LSTM units to build temporal dependencies on previous trajectory path and generate future trajectory. We conducted extensive experiments on two real-world datasets and demonstrated that our model achieved higher prediction accuracy compared with competitive trajectory prediction methods.

Jingze Cui, Xian Zhou, Yanmin Zhu, Yanyan Shen
Secure Data Aggregation with Integrity Verification in Wireless Sensor Networks

In recent years, wireless sensor networks (WSNs) have become a useful tool for environmental monitoring and information collection due to their strong sensory ability. Whereas WSNs utilize wireless communication and is usually deployed in an outdoors environment, which make them vulnerable to be attacked and then lead to the privacy disclosure of the monitored environment. SUM, as one common query among the queries of WSNs, is important to acquire a high-level understanding of the monitored environment and establish the basis for other advanced queries. In this paper, we present a secure hash-based privacy preservation mechanism called HP2M, which not only preserves the privacy of the monitored environment during SUM aggregation query, but also could achieve exact SUM aggregation. Furthermore, an integrity verification mechanism is proposed to verify the integrity of SUM aggregation result, which could alarm the system once data packets transmitted through the networks are modified. One main characteristic of HP2M and the proposed integrity verification mechanism is that they are lightweight with a small bandwidth consumption. Finally, some numerical experiments are performed to demonstrate the efficiency of our proposed approach.

Ying Liu, Hui Peng, Yuncheng Wu, Juru Zeng, Hong Chen, Ke Wang, Weiling Lai, Cuiping Li
A Parallel Spatial Co-location Pattern Mining Approach Based on Ordered Clique Growth

Co-location patterns or subsets of spatial features, whose instances are frequently located together, are particularly valuable for discovering spatial dependencies. Although lots of spatial co-location pattern mining approaches have been proposed, the computational cost is still expensive. In this paper, we propose an iterative mining framework based on MapReduce to mine co-location patterns efficiently from massive spatial data. Our approach searches for co-location patterns in parallel through expanding ordered cliques and there is no candidate set generated. A large number of experimental results on synthetic and real-world datasets show that the proposed method is efficient and scalable for massive spatial data, and is faster than other parallel methods.

Peizhong Yang, Lizhen Wang, Xiaoxuan Wang

RDF and Knowledge Graphs

Frontmatter
Multi-query Optimization in Federated RDF Systems

This paper revisits the classical problem of multiple query optimization in federated RDF systems. We propose a heuristic query rewriting-based approach to share the common computation during evaluation of multiple queries while considering the cost of both query evaluation and data shipment. Furthermore, we propose an efficient method to use the interconnection topology between RDF sources to filter out irrelevant sources and share the common computation of intermediate results joining. The experiments over both real and synthetic RDF datasets show that our techniques are efficient.

Peng Peng, Lei Zou, M. Tamer Özsu, Dongyan Zhao
Distributed Efficient Provenance-Aware Regular Path Queries on Large RDF Graphs

With the proliferation of knowledge graphs, massive RDF graphs have been published on the Web. As an essential type of queries for RDF graphs, Regular Path Queries (RPQs) have been attracting increasing research efforts. However, the existing query processing approaches mainly focus on the standard semantics of RPQs, which cannot provide provenance of the answer sets. We propose dProvRPQ that is a distributed approach to evaluating provenance-aware RPQs over big RDF graphs. Our Pregel-based method employs Glushkov automata to keep track of matching processes of RPQs in parallel. Meanwhile, four optimization strategies are devised, including edge filtering, candidate states, message compression, and message selection, which can reduce the intermediate results of the basic dProvRPQ algorithm dramatically and overcome the counting-paths problem to some extent. The proposed algorithms are verified by extensive experiments on both synthetic and real-world datasets, which show that our approach can efficiently answer the provenance-aware RPQs over large RDF graphs.

Yueqi Xin, Xin Wang, Di Jin, Simiao Wang
Discovering Graph Patterns for Fact Checking in Knowledge Graphs

Given a knowledge graph and a fact (a triple statement), fact checking is to decide whether the fact belongs to the missing part of the graph. This paper proposes a new fact checking method based on supervised graph pattern mining. Our method discovers discriminant graph patterns associated with the training facts. These patterns can then be used to construct classifiers based on either rules or latent features. (1) We propose a class of graph fact checking rules ($$\mathsf {GFCs}$$). A $$\mathsf {GFC}$$ incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide quality measures to characterize useful patterns that are both discriminant and diversified. (2) We show that it is feasible to discover $$\mathsf {GFCs}$$ in large graphs, by developing a supervised pattern discovery algorithm. To find useful $$\mathsf {GFCs}$$ as early as possible, it generates graph patterns relevant to training facts, and dynamically selects patterns from a pattern stream with small update cost per pattern. We further construct two $$\mathsf {GFC}$$-based models, which make use of ordered $$\mathsf {GFCs}$$ as predictive rules and latent features from the pattern matches of $$\mathsf {GFCs}$$, respectively. Using real-world knowledge bases, we experimentally verify the efficiency and the effectiveness of $$\mathsf {GFC}$$-based techniques for fact checking.

Peng Lin, Qi Song, Jialiang Shen, Yinghui Wu
KAT: Keywords-to-SPARQL Translation Over RDF Graphs

In this paper, we focus on the problem of translating keywords into SPARQL query effectively and propose a novel approach called KAT. KAT takes into account the context of each input keyword and reduces the ambiguity of input keywords by building a keyword index which contains the class information of keywords in RDF data. To explore RDF data graph efficiently, KAT builds a graph index as well. Moreover, a context aware ranking method is proposed to find the most relevant SPARQL query. Extensive experiments are conducted to show that KAT is both effective and efficient.

Yanlong Wen, Yudong Jin, Xiaojie Yuan

Text and Data Mining

Frontmatter
A Scalable Framework for Stylometric Analysis of Multi-author Documents

Stylometry is a statistical technique used to analyze the variations in the author’s writing styles and is typically applied to authorship attribution problems. In this investigation, we apply stylometry to authorship identification of multi-author documents (AIMD) task. We propose an AIMD technique called Co-Authorship Graph (CAG) which can be used to collaboratively attribute different portions of documents to different authors belonging to the same community. Based on CAG, we propose a novel AIMD solution which (i) significantly outperforms the existing state-of-the-art solution; (ii) can effectively handle a larger number of co-authors; and (iii) is capable of handling the case when some of the listed co-authors have not contributed to the document as a writer. We conducted an extensive experimental study to compare the proposed solution and the best existing AIMD method using real and synthetic datasets. We show that the proposed solution significantly outperforms existing state-of-the-art method.

Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, Thanawin Rakthanmanon
Is a Common Phrase an Entity Mention or Not? Dual Representations for Domain-Specific Named Entity Recognition

Named Entity Recognition (NER) for specific domains is critical for building and managing domain-specific knowledge bases, but conventional NER methods cannot be applied to specific domains effectively. We found that one of reasons is the problem of common-phrase-like entity mention prevalent in many domains. That is, many common phrases frequently occurring in general corpora may or may not be treated as named entities in specific domains. Therefore, determining whether a common phrase is an entity mention or not is a challenge. To address this issue, we present a novel BLSTM based NER model tailored for specific domains by learning dual representations for each word. It learns not only general domain knowledge derived from an external large scale general corpus via a word embedding model, but also the specific domain knowledge by training a stacked deep neural network (SDNN) integrating the results of a low-cost pre-entity-linking process. Extensive experiments on a real-world dataset of movie comments demonstrate the superiority of our model over existing state-of-the-art methods.

Jiangtao Zhang, Juanzi Li, Xiao-Li Li, Yixin Cao, Lei Hou, Shuai Wang
Recognizing Textual Entailment with Attentive Reading and Writing Operations

Inferencing the entailment relations between natural language sentence pairs is fundamental to artificial intelligence. Recently, there is a rising interest in modeling the task with neural attentive models. However, those existing models have a major limitation to keep track of the attention history because usually only one single vector is utilized to memorize the past attention information. We argue its importance based on our observation that the potential alignment clues are not always centralized. Instead, they may diverge substantially, which could cause the problem of long-range dependency. In this paper, we propose to facilitate the conventional attentive reading operations with two sophisticated writing operations - forget and update. Instead of utilizing a single vector that accommodates the attention history, we write the past attention information directly into the sentence representations. Therefore, higher memory capacity of attention history could be achieved. Experiments on Stanford Natural Language Inference corpus (SNLI) demonstrate the superior efficacy of our proposed architecture.

Liang Liu, Huan Huo, Xiufeng Liu, Vasile Palade, Dunlu Peng, Qingkui Chen
Interpreting Fine-Grained Categories from Natural Language Queries of Entity Search

The fine-grained target categories/types are very critical for improving the performance of entity search because they can be used for retrieving relevant entities by filtering irrelevant entities with a high confidence. However, most solutions of entity search face an urgent problem, i.e., the lack of fine-grained target categories of queries, which are hard for users to explicitly specify. In this paper, we try to interpret fine-grained categories from natural language based queries of entity search. We observe that entity search queries often contain terms specifying the contexts of the desired entities, as well as a topic of the desired entities. Accordingly, we propose to interpret fine-grained categories of entity search queries from the context perspective and the topic perspective. Therefore, we propose an approach by formalizing both context-based category model and topic-based category model, to tackle the category interpreting task. Extensive experiments on two widely-used test sets: INEX-XER 2009 and SemSearch-LS, indicate significant performance improvement achieved by our proposed method over the state-of-the-art baselines.

Denghao Ma, Yueguo Chen, Xiaoyong Du, Yuanzhe Hao
Improving Short Text Modeling by Two-Level Attention Networks for Sentiment Classification

Understanding short texts is crucial to many applications, but it has always been challenging, due to the sparsity and ambiguity of information in short texts. In addition, sentiments expressed in those user-generated short texts are often implicit and context dependent. To address this, we propose a novel model based on two-level attention networks to identify the sentiment of short text. Our model first adopts attention mechanism to capture both local features and long-distance dependent features simultaneously, so that it is more robust against irrelevant information. Then the attention-based features are non-linearly combined with a bidirectional recurrent attention network, which enhances the expressive power of our model and automatically captures more relevant feature combinations. We evaluate the performance of our model on MR, SST-1 and SST-2 datasets. The experimental results show that our model can outperform the previous methods.

Yulong Li, Yi Cai, Ho-fung Leung, Qing Li
Efficient and Scalable Mining of Frequent Subgraphs Using Distributed Graph Processing Systems

Mining frequent subgraphs in large scale graph data sets helps reveal underlying knowledge. Since the mining approaches in centralized systems are often bottlenecked on calculation capacity, many parallelized solutions based on the MapReduce framework are proposed to scale out the mining process, which usually extracts frequent subgraphs in an iterative way. Nonetheless, the efficiency and scalability of these MapReduce based approaches are still bounded by the communication cost for passing the intermediate results and the unbalanced workload after a few iterations. In this paper, we propose an efficient and scalable framework for frequent subgraph mining by using distributed graph processing systems. It adopts a message-passing-free scheme among workers to reduce the communication cost, and utilizes a task scheduler to dynamically balance the workload. Experimental results on both synthetic and real-world data sets verify the efficacy of our proposed framework.

Tongtong Wang, Hao Huang, Wei Lu, Zhe Peng, Xiaoyong Du
Efficient Infrequent Itemset Mining Using Depth-First and Top-Down Lattice Traversal

Frequent itemset mining is substantially studied in the past decades. In varies practical applications, frequent patterns are obvious and expected, while really interesting information might hide in obscure rarity. However, existing rare pattern mining approaches are time and memory consuming due to their apriori based candidate generation step. In this paper, we propose an efficient rare pattern extraction algorithm, which is capable of extracting the complete set of rare patterns using a top-down traversal strategy. A negative item tree is employed to accelerate the mining process. Pattern growth paradigm is used and therefore avoids expensive candidate generation.

Yifeng Lu, Florian Richter, Thomas Seidl
Online Subset Topic Modeling for Interactive Documents Exploration

Data exploration over text databases is an important problem. In an exploration scenario, users would find something useful without previously knowing what exactly they are looking for, until the time they identify them. Therefore, labor-intensive efforts are often required, since users have to review the overview (or detail) results of ad-hoc queries and adjust the queries (e.g., zoom or filter) continuously. Probabilistic topic models are often adopted as a solution to provide the overview for a given text collection, since it could discover the underlying thematic structures of unstructured text data. However, training a topic model for a selected document collection is time consuming. Moreover, frequent model retraining would be introduced by continuous query-adjusting, which leads to large amount of time wasting and therefore is unsuitable for online exploration. To remedy this problem, this paper presents STMS, an algorithm for constructing topic structures in document subsets efficiently. STMS accelerates the process of subset modeling by leveraging global precomputation and applying an efficient sampling-based inference algorithm. The experiments on real world datasets show that STMS achieves orders of magnitude speed-ups than standard topic model, while remaining comparable in terms of modeling quality.

Linwei Li, Yaobo Wu, Yixiong Ke, Chaoying Liu, Yinan Jing, Zhenying He, Xiaoyang Sean Wang
Main Point Generator: Summarizing with a Focus

Text summarization is attracting more and more attention while deep neural network has had many successful application in NLP. One problem of such models is its inability to focus on the essentials of documents, thus generating summaries that may not be important, especially during multi-sentence summarization. In this paper, we propose Main Pointer Generator (MPG) to address the problem, where at each decoder step the whole document is taken into consideration when calculating the probability of next generated token. We experiment with CNN/Daily news corpus and results show that summaries our MPG generated follow the main theme while outperforming the original pointer generator network by about 0.5 ROUGE point.

Tong Lee Chung, Bin Xu, Yongbin Liu, Chunping Ouyang
Backmatter
Metadaten
Titel
Database Systems for Advanced Applications
herausgegeben von
Jian Pei
Yannis Manolopoulos
Shazia Sadiq
Jianxin Li
Copyright-Jahr
2018
Electronic ISBN
978-3-319-91452-7
Print ISBN
978-3-319-91451-0
DOI
https://doi.org/10.1007/978-3-319-91452-7