Skip to main content

Über dieses Buch

This two volume set LNCS 10177 and 10178 constitutes the refereed proceedings of the 22nd International Conference on Database Systems for Advanced Applications, DASFAA 2017, held in Suzhou, China, in March 2017. The 73 full papers, 9 industry papers, 4 demo papers and 3 tutorials were carefully selected from a total of 300 submissions. The papers are organized around the following topics: semantic web and knowledge management; indexing and distributed systems; network embedding; trajectory and time series data processing; data mining; query processing and optimization; text mining; recommendation; security, privacy, senor and cloud; social network analytics; map matching and spatial keywords; query processing and optimization; search and information retrieval; string and sequence processing; stream date processing; graph and network data processing; spatial databases; real time data processing; big data; social networks and graphs.



Map Matching and Spatial Keywords


HIMM: An HMM-Based Interactive Map-Matching System

Due to the inaccuracy of GPS devices, the location error of raw GPS points can be up to several hundred meters. Many applications using GPS-based vehicle location data require map-matching to pre-process GPS points by aligning them to a road network. However, existing map-matching algorithms can be limited in accuracy due to various factors including low sampling rates, abnormal GPS points, and dense road networks. In this paper, we propose the design and implementation of HIMM, an HMM-based Interactive Map-Matching system that produces accurate map-matching results through human interaction. The main idea is to involve human annotations in the matching process of some elaborately selected error-prone points and to let the system automatically adjust the matching of the remaining points. We use both real-world and synthetic datasets to evaluate the system. The results show that HIMM can significantly reduce human annotation costs comparing to the baseline methods.

Xibo Zhou, Ye Ding, Haoyu Tan, Qiong Luo, Lionel M. Ni

HyMU: A Hybrid Map Updating Framework

Accurate digital map plays an important role in mobile navigation. Due to the ineffective updating mechanism, existing map updating methods cannot guarantee completeness and validity of the map. The common problems of them involve huge computation and low precision. More importantly, they scarcely consider inferring new roads on sparse unmatched trajectories. In this paper, we first address the issue of finding new roads in sparse trajectory area. On the basis of sliding window model, we propose a two-phase hybrid framework to update the digital map with inferred roads, called HyMU, which takes full advantage of line-based and point-based strategies. Through inferring road candidates for consecutive time windows and merging the candidates to form missing roads, HyMU can even discover new roads in sparse trajectory area. Therefore, HyMU has high recall and precision on trajectory data of different density and sampling rate. Experimental results on real data sets show that our proposal is both effective and efficient as compared to other congeneric approaches.

Tao Wang, Jiali Mao, Cheqing Jin

Multi-objective Spatial Keyword Query with Semantics

Multi-objective spatial keyword query finds broad applications in map services nowadays. It aims to find a set of objects that can cover all query objectives and are reasonably distributed in spatial. However, existing approaches mainly take the coverage of query keywords into account, while leaving the semantics behind the textual data to be largely ignored. This limits us to return those rational results that are synonyms but morphologically different. To address this problem, this paper studies the problem of multi-objective spatial keyword query with semantics. It targets to return the object set that is optimum regarding to both spatial proximity and semantic relevance. We propose an indexing structure called LIR-tree, as well as two advanced query processing approaches to achieve efficient query processing. Empirical study based on real dataset demonstrates the good effectiveness and efficiency of our proposed algorithms.

Jing Chen, Jiajie Xu, Chengfei Liu, Zhixu Li, An Liu, Zhiming Ding

Query Processing and Optimization (II)


RSkycube: Efficient Skycube Computation by Reusing Principle

Over the past years, the skyline query has already attracted wide attention in database community. In order to meet different preferences for users, the skycube computation is proposed to compute skylines, or cuboids, on all possible non-empty dimension subsets. The key issue of computing skycube is how to share computation among multiple related cuboids, which classified into sharing strict space dominance and sharing space incomparability. However, state-of-the-art algorithm only leverages sharing strict space dominance to compute skycube. This paper aims to design a more efficient skycube algorithm that shares computation among multiple related cuboids. We first propose a set of rules named identical partitioning (IP) for constructing a novel structure VSkyTree. Moreover, we present the reusing principle, which utilizes both sharing strict space dominance and sharing space incomparability by reusing VSkyTree on parent cuboids to compute child cuboids. Then, in top-down fashion, we design an efficient skycube computation algorithm RSkycube based on the reusing principle. Our experimental results indicate that our algorithm RSkycube significantly outperforms state-of-the-art skycube computation algorithm on both synthetic and real datasets.

Kaiqi Zhang, Hong Gao, Xixian Han, Donghua Yang, Zhipeng Cai, Jianzhong Li

Similarity Search Combining Query Relaxation and Diversification

We study the similarity search problem which aims to find the similar query results according to a set of given data and a query string. To balance the result number and result quality, we combine query result diversity with query relaxation. Relaxation guarantees the number of the query results, returning more relevant elements to the query if the results are too few, while the diversity tries to reduce the similarity among the returned results. By making a trade-off of similarity and diversity, we improve the user experience. To achieve this goal, we define a novel goal function combining similarity and diversity. Aiming at this goal, we propose three algorithms. Among them, algorithms genGreedy and genCluster perform relaxation first and select part of the candidates to diversify. The third algorithm CB2S splits the dataset into smaller pieces using the clustering algorithm of k-means and processes queries in several small sets to retrieve more diverse results. The balance of similarity and diversity is determined through setting a threshold, which has a default value and can be adjusted according to users’ preference. The performance and efficiency of our system are demonstrated through extensive experiments based on various datasets.

Ruoxi Shi, Hongzhi Wang, Tao Wang, Yutai Hou, Yiwen Tang, Jianzhong Li, Hong Gao

An Unsupervised Approach for Low-Quality Answer Detection in Community Question-Answering

Community Question Answering (CQA) sites such as Yahoo! Answers provide rich knowledge for people to access. However, the quality of answers posted to CQA sites often varies a lot from precise and useful ones to irrelevant and useless ones. Hence, automatic detection of low-quality answers will help the site managers efficiently organize the accumulated knowledge and provide high-quality contents to users. In this paper, we propose a novel unsupervised approach to detect low-quality answers at a CQA site. The key ideas in our model are: (1) most answers are normal; (2) low-quality answers can be found by checking its “peer” answers under the same question; (3) different questions have different answer quality criteria. Based on these ideas, we devise an unsupervised learning algorithm to assign soft labels to answers as quality scores. Experiments show that our model significantly outperforms the other state-of-the-art models on answer quality prediction.

Haocheng Wu, Zuohui Tian, Wei Wu, Enhong Chen

Approximate OLAP on Sustained Data Streams

Many organizations require detailed and real time analysis of their business data for effective decision making. OLAP is one of the commonly used methods for the analysis of static data and has been studied by many researchers. OLAP is also applicable to data streams, however the requirement to produce real time analysis on fast and evolving data streams is not possible unless the data to be analysed reside on memory. Keeping in view the limited size and the volatile nature of the memory, we propose a novel architecture AOLAP which in addition to storing raw data streams to the secondary storage, maintains data stream’s summaries in a compact memory-based data structure. This work proposes the use of piece-wise linear approximation (PLA) for storing such data summaries corresponding to each materialized node in the OLAP cube. Since the PLA is a compact data structure, it can store the long data streams’ summaries in comparatively smaller space and can give approximate answers to OLAP queries.OLAP analysts query different nodes in the OLAP cube interactively. To support such analysis by the PLA-based data cube without the unnecessary amplification of querying errors, inherent in the PLA structure, many nodes should be materialized. However, even though each PLA structure is compact, it is impossible to materialize all the nodes in the OLAP cube. Thus, we need to select the best set of materialized nodes which can give query results with the minimum approximation errors within the given memory bound. This problem is NP-hard. Hence this work also proposes an optimization scheme to support this selection. Detailed experimental evaluation is performed to prove the effectiveness of the use of PLA structure and the optimization scheme.

Salman Ahmed Shaikh, Hiroyuki Kitagawa

Search and Information Retrieval


Hierarchical Semantic Representations of Online News Comments for Emotion Tagging Using Multiple Information Sources

With the development of online news services, users now can actively respond to online news by expressing subjective emotions, which can help us understand the predilections and opinions of an individual user, and help news publishers to provide more relevant services. Neural network methods have achieved promising results, but still have challenges in the field of emotion tagging. Firstly, these methods regard the whole document as a stream or bag of words and can’t encode the intrinsic relations between sentences. So these methods cannot properly express the semantic meaning of the document in which sentences may have logical relations. Secondly, these methods only use semantics of the document itself, while ignoring the accompanying information sources, which can significantly influence the interpretation of the sentiment contained in documents. Therefore, this paper presents a hierarchical semantic representation model of news comments using multiple information sources, called Hierarchical Semantic Neural Network (HSNN). In particular, we begin with a novel neural network model to learn document representation in a bottom-up way, capturing not only the semantics within sentence but also semantics or logical relations between sentences. On top of this, we tackle the task of predicting emotions for online news comments by exploiting multiple information sources including the content of comments, the content of news articles, and the user-generated emotion votes. A series of experiments and tests on real-world datasets have demonstrated the effectiveness of our proposed approach.

Chao Wang, Ying Zhang, Wei Jie, Christian Sauer, Xiaojie Yuan

Towards a Query-Less News Search Framework on Twitter

Twitter enables users to browse and access the latest news-related content. However, given user’s interest in a particular news-related tweet, searching for related content may be a tedious process. Formulating an effective search query is not a trivial task. And due to the often small size of smart phone screens, instead of typing, users always prefer click-based operations to retrieve related content. To address these issues, we introduce a new paradigm for news-related Twitter search called Search by Tweet(SbT). In this paradigm, a user submits a particular tweet which triggers a search task to retrieve further related tweets. In this paper, we formalize the SbT problem and propose an effective and efficient framework implementing such a functionality. At the core, we model the public Twitter stream as a dynamic graph-of-words, reflecting the importance of both words and word correlations. Given an input tweet, our framework utilizes the graph model to generate an implicit query. Our techniques demonstrate high efficiency and effectiveness as evaluated using a large-scale Twitter dataset and a user study.

Xiaotian Hao, Ji Cheng, Jan Vosecky, Wilfred Ng

Semantic Definition Ranking

Question answering has been a focus of much attention from academia and industry. Search engines have already tried to provide direct answers for question-like queries. Among these queries, “What” is one of the biggest segments. Since results excerpted from Wikipedia often have a coverage problem, some models begin to rank definitions that are extracted from web documents, including Ranking SVM and Maximum Entropy Context Model. But they only adopt syntactic features and cannot understand definitions semantically. In this paper, we propose a language model incorporating knowledge bases to learn the regularities behind good definitions. It combines recurrent neural network based language model with a process of mapping words to context-appropriate concepts. Using the knowledge learnt from neural networks, we define two semantic features to evaluate definitions, one of which is confirmed to be effective by experiments. Results show that our model improves precision a lot. Our approach has been applied in production.

Zehui Hao, Zhongyuan Wang, Xiaofeng Meng, Jun Yan, Qiuyue Wang

An Improved Approach for Long Tail Advertising in Sponsored Search

Search queries follow a long tail distribution which results in harder management of ad space for sponsored search. During keyword auctions, advertisers also tend to target head query keywords, thereby creating an imbalance in demand for head and tail keywords. This leads to under-utilization of ad space of tail query keywords. In this paper, we have explored a mechanism that allows the advertisers to bid on concepts rather than keywords. The tail query keywords are utilized by allocating a mix of head and tail keywords related to the concept. In the literature, an effort has been made to improve sponsored search by extracting the knowledge of coverage patterns among the keywords of transactional query logs. In this paper, we propose an improved approach to allow advertisers to bid on high level concepts instead of keywords in sponsored search. The proposed approach utilizes the knowledge of level-wise coverage patterns to allocate incoming search queries to advertisers in an efficient manner by utilizing the long tail. Experimental results on AOL search query data set show improvement in ad space utilization and reach of advertisers.

Amar Budhiraja, P. Krishna Reddy

String and Sequence Processing


Locating Longest Common Subsequences with Limited Penalty

Locating longest common subsequences is a typical and important problem. The original version of locating longest common subsequences stretches a longer alignment between a query and a database sequence finds all alignments corresponding to the maximal length of common subsequences. However, the original version produces a lot of results, some of which are meaningless in practical applications and rise to a lot of time overhead. In this paper, we firstly define longest common subsequences with limited penalty to compute the longest common subsequences whose penalty values are not larger than a threshold $$\tau $$. This helps us to find answers with good locality. We focus on the efficiency of this problem. We propose a basic approach for finding longest common subsequences with limited penalty. We further analyze features of longest common subsequences with limited penalty, and based on it we propose a filter-refine approach to reduce number of candidates. We also adopt suffix array to efficiently generate common substrings, which helps calculating the problem. Experimental results on three real data sets show the effectiveness and efficiency of our algorithms.

Bin Wang, Xiaochun Yang, Jinxu Li

Top-k String Auto-Completion with Synonyms

Auto-completion is one of the most prominent features of modern information systems. The existing solutions of auto-completion provide the suggestions based on the beginning of the currently input character sequence (i.e. prefix). However, in many real applications, one entity often has synonyms or abbreviations. For example, “DBMS” is an abbreviation of “Database Management Systems”. In this paper, we study a novel type of auto-completion by using synonyms and abbreviations. We propose three trie-based algorithms to solve the top-k auto-completion with synonyms; each one with different space and time complexity trade-offs. Experiments on large-scale datasets show that it is possible to support effective and efficient synonym-based retrieval of completions of a million strings with thousands of synonyms rules at about a microsecond per-completion, while taking small space overhead (i.e. 160–200 bytes per string). The implementation of algorithms is publicly available at

Pengfei Xu, Jiaheng Lu

Efficient Regular Expression Matching on Compressed Strings

Existing methods for regular expression matching on LZ78 compressed strings do not perform efficiently. Moreover, LZ78 compression has some shortcomings, such as high compression ratio and slower decompression speed than LZ77 (a variant of LZ78). In this paper, we study regular expression matching on LZ77 compressed strings. To address this problem, we propose an efficient algorithm, namely, RELZ, utilizing the positive factors, i.e., a prefix and a suffix, and negative factors (Negative factors are substrings that cannot appear in an answer.) of the regular expression to prune the candidates. For the sake of quickly locating these two kinds of factors on the compressed string without decompression, we design a variant suffix trie index, called SSLZ. In addition, we construct bitmaps for factors of regular expression to detect potential region and propose block filtering to reduce candidates. At last, we conduct a comprehensive performance evaluation using five real datasets to validate our ideas and the proposed algorithms. The experimental result shows that our RELZ algorithm outperforms the existing algorithms significantly.

Yutong Han, Bin Wang, Xiaochun Yang, Huaijie Zhu

Mining Top-k Distinguishing Temporal Sequential Patterns from Event Sequences

Sequential patterns are useful in many areas such as biomedical sequence analysis, web browsing log analysis, and historical banking transaction log analysis. Distinguishing sequential patterns can help characterize the differences between two or more sets/classes of sequences, and can be used to understand those sequence sets/classes and to identify informative features for classification and so on. However, previous studies have not considered how to mine distinguishing sequential patterns from event sequences, where each event in a sequence has an associated timestamp. To fill that gap, this paper considers the mining of distinguishing temporal event patterns (DTEP) from event sequences. After discussing the challenges on DTEP mining, we present DTEP-Miner, a mining method with various pruning techniques, for mining DTEPs with top-k contrast scores. Our empirical study using both real data and synthetic data demonstrates that DTEP-Miner is effective and efficient.

Lei Duan, Li Yan, Guozhu Dong, Jyrki Nummenmaa, Hao Yang

Stream Data Processing


Soft Quorums: A High Availability Solution for Service Oriented Stream Systems

Large-scale information gathering becomes more and more common with the increasing popularity of smartphones, GPS, social networks, and sensor networks. Services based on this real-time data are the logical next step. Service Oriented Stream Systems (SOSS) have a focus on one-time ad hoc queries as opposed to continuous queries. High availability is crucial in these services. However, data replication has inherent costs, which are particularly burdensome for high rate, often overloaded, SOSS. To provide high availability and to cope with the problem of overloading the system, we propose a mechanism called soft quorums. Soft quorums provide high availability of data, a tradeoff between query result accuracy and performance, and adaptation to dynamic data/query stream rates. Finally, we conduct a comprehensive experimental study using real-world and synthetic datasets.

Chunyao Song, Tingjian Ge, Cindy Chen, Jie Wang

StroMAX: Partitioning-Based Scheduler for Real-Time Stream Processing System

With the increasing availability and scale of data from Web 2.0, the ability to efficiently and timely analyze huge amounts of data is important for industry success. A number of real-time stream processing platforms have been developed, such as Storm, S4, and Flume. A fundamental problem of these large scale decentralized stream processing systems is how to deploy the workload to each node so as to fully utilize the available resources and optimize the overall system performance. In this paper, we present StroMAX, a graph-partitioning based approach of workload scheduling for real-time stream processing systems. StroMAX uses two advanced generic schedulers to improve the performance of stream processing systems by reducing the inter-node communication cost while keeping the workload of nodes below a certain computational load threshold. The first scheduler analyzes the workload structure when a job is committed and uses the graph-partitioning result to determine the deployment of tasks. The second scheduler analyzes the statistical information of physical nodes, and dynamically reassigns the tasks during runtime to improve the overall performance. Besides, StroMAXcan be deployed to many other state-of-the-art real-time stream processing systems easily. We implemented StroMAX on Storm, a representative real-time stream processing system. Extensive experiments conducted with real-world workloads and datasets demonstrate the superiority of our approaches against the existing solutions.

Jiawei Jiang, Zhipeng Zhang, Bin Cui, Yunhai Tong, Ning Xu

Partition-Based Clustering with Sliding Windows for Data Streams

Data stream clustering with sliding windows generates clusters for every window movement. Because repeated clustering on all changed windows is highly inefficient in terms of memory and computation time, a clustering algorithm should be designed with considering only inserted and deleted tuples of windows. In this paper, we address this problem by sliding window aggregation technique and cluster modification strategy. We propose a novel data structure for construction and maintenance of 2-level synopses. This data structure enables to update synopses efficiently and support precise sliding window operations. We also suggest a modification strategy to decide whether to append new synopses to pre-existing clusters or perform clustering on whole synopses according to the difference between probability distributions of the original and updated clusters. Experimental results show that proposed method outperforms state-of-the-art methods.

Jonghem Youn, Jihun Choi, Junho Shim, Sang-goo Lee

CBP: A New Parallelization Paradigm for Massively Distributed Stream Processing

Resource efficiency is essential for distributed stream processing engines (DSPEs), in which a streaming application is modeled as an operator graph where each operator is parallelized into a number of instances to meet the low-latency and high-throughput requirements. The major objectives of optimizing resource efficiency in DSPEs include minimizing the communication cost by collocating the tasks that transfer a lot of data between each other, and by dynamically configuring the systems according to the load variations at runtime. In the current literature, most proposals handle these two optimizations separately, and a shallow integration of these techniques, such as performing the two optimizations one after another, would result in a suboptimal solution. In this paper, we present component-based parallelization (CBP), a new paradigm for optimizing the resource efficiency of DSPEs, which provides a framework for a deeper integration of the two optimizations. In the CBP paradigm, the operators are encapsulated into a set of non-overlapping components, in which operators are parallelized consistently, i.e., using the same partitioning key, and hence the intra-component communication is eliminated. According to the changes of workload, each component can be adaptively partitioned into multiple instances, each of which is deployed on a computing node. We build a cost model to capture both the communication cost and adaptation cost of a CBP plan, and then propose several optimization algorithms. We implement the CBP scheme and the optimization algorithms on top of Apache Storm, and verify its efficiency by an extensive experiment study.

Qingsong Guo, Yongluan Zhou

Social Network Analytics (II)


Measuring and Maximizing Influence via Random Walk in Social Activity Networks

With the popularity of OSNs, finding a set of most influential users (or nodes) so as to trigger the largest influence cascade is of significance. For example, companies may take advantage of the “word-of-mouth” effect to trigger a large cascade of purchases by offering free samples/discounts to those most influential users. This task is usually modeled as an influence maximization problem, and it has been widely studied in the past decade. However, considering that users in OSNs may participate in various kinds of online activities, e.g., giving ratings to products, joining discussion groups, etc., influence diffusion through online activities becomes even more significant.In this paper, we study the impact of online activities by formulating the influence maximization problem for social-activity networks (SANs) containing both users and online activities. To address the computation challenge, we define an influence centrality via random walks to measure influence, then use the Monte Carlo framework to efficiently estimate the centrality in SANs. Furthermore, we develop a greedy-based algorithm with two novel optimization techniques to find the most influential users. By conducting extensive experiments with real-world datasets, we show our approach is more efficient than the state-of-the-art algorithm IMM [17] when we needs to handle large amount of online activities.

Pengpeng Zhao, Yongkun Li, Hong Xie, Zhiyong Wu, Yinlong Xu, John C. S. Lui

Adaptive Overlapping Community Detection with Bayesian NonNegative Matrix Factorization

Overlapping Community Detection from a real network is unsupervised, and it is hard to know the exact community number or quantized strength of every node related to each community. Using Non-negative Matrix Factorization (NMF) for Community Detection, we can find two non-negative matrices from whole network adjacent matrix, and the product of two matrices approximates the original matrix well. With Bayesian explanation in factorizing process, we can not only catch most appropriate count of communities in a large network with Shrinkage method, but also verify good threshold how a node should be assigned to a community in fuzzy situation.We apply our approach in some real networks and a synthetic network with benchmark. Experimental results for overlapping community detection show that our method is effective to find the communities number and overlapping degree, and achieve better performance than other existing overlapping community detection methods.

Xiaohua Shi, Hongtao Lu, Guanbo Jia

A Unified Approach for Learning Expertise and Authority in Digital Libraries

Managing individual expertise is a major concern within any industrial-wide organization. If previous works have extensively studied the related expertise and authority profiling issues, they assume a semantic independence of these two key concepts. In digital libraries, state-of-the-art models generally summarize the researchers’ profile by using solely textual information. Consequently, authors with a large amount of publications are mechanically fostered to the detriment of less prolific ones with probably higher expertise. To overcome this drawback we propose to merge the two representations of expertise and authority and balance the results by capturing a mutual reinforcement principle between these two notions. Based on a graph representation of the library, the expert profiling task is formulated as an optimization problem where latent expertise and authority representations are learned simultaneously, unbiasing the expertise scores of individuals with a large amount of publications. The proposal is instanciated on a public scientific bibliographic dataset where researchers’ publications are considered as a source of evidence of individuals’ expertise and citation relations as a source of authoritative signals. Results from our experiments conducted over the Microsoft Academic Search database demonstrate significant efficiency improvement in comparison with state-of-the-art models for the expert retrieval task.

B. de La Robertie, L. Ermakova, Y. Pitarch, A. Takasu, O. Teste

Graph and Network Data Processing


Efficient Local Clustering Coefficient Estimation in Massive Graphs

Graph is a powerful tool to model interactions in disparate applications, and how to assess the structure of a graph is an essential task across all the domains. As a classic measure to characterize the connectivity of graphs, clustering coefficient and its variants are of particular interest in graph structural analysis. However, the largest of today’s graphs may have nodes and edges in billion scale, which makes the simple task of computing clustering coefficients quite complicated and expensive. Thus, approximate solutions have attracted much attention from researchers recently. However, they only target global and binned degree-wise clustering coefficient estimation, and their techniques are not suitable for local clustering coefficient estimation that is of great importance for individual nodes. In this paper, we propose a new sampling scheme to estimate the local clustering coefficient with error bounded, where global and binned degree-wise clustering coefficients can be considered as special cases. Meanwhile, based on our sampling scheme, we propose a new framework to estimate all the three clustering coefficients in a unified way. To make it scalable on massive graphs, we further design an efficient MapReduce algorithm under this framework. Extensive experiments validate the efficiency and effectiveness of our algorithms, which significantly outperform state-of-the-art exact and approximate algorithms on many real graph datasets.

Hao Zhang, Yuanyuan Zhu, Lu Qin, Hong Cheng, Jeffrey Xu Yu

Efficient Processing of Growing Temporal Graphs

Temporal graphs are useful in modeling real-world networks. For example, in a phone call network, people may communicate with each other in multiple time periods, which can be modeled as multiple temporal edges. However, the size of real-world temporal graphs keeps increasing rapidly (e.g., considering the number of phone calls recorded each day), which makes it difficult to efficiently store and analyze the complete temporal graphs. We propose a new model, called equal-weight damped time window model, to efficiently manage temporal graphs. In this model, each time window is assigned a unified weight. This model is flexible as it allows users to control the tradeoff between the required storage space and the information loss. It also supports efficient maintenance of the windows as new data come in. We then discuss applications that use the model for analyzing temporal graphs. Our experiments demonstrated that we can handle massive temporal graphs efficiently with limited space.

Huanhuan Wu, Yunjian Zhao, James Cheng, Da Yan

Effective k-Vertex Connected Component Detection in Large-Scale Networks

Finding components with high connectivity is an important problem in component detection with a wide range of applications, e.g., social network analysis, web-page research and bioinformatics. In particular, k-edge connected component (k-ECC) has recently been extensively studied to discover disjoint components. Yet many real applications present needs and challenges for overlapping components. In this paper, we propose a k-vertex connected component (k-VCC) model, which is much more cohesive and therefore allows overlapping between components. To find k-VCCs, a top-down framework is first developed to find the exact k-VCCs. To further reduce the high computational cost for input networks of large sizes, a bottom-up framework is then proposed. Instead of using the structure of the entire network, it locally identifies the seed subgraphs, and obtains the heuristic k-VCCs by expanding and merging these seed subgraphs. Comprehensive experimental results on large real and synthetic networks demonstrate the efficiency and effectiveness of our approaches.

Yuan Li, Yuhai Zhao, Guoren Wang, Feida Zhu, Yubao Wu, Shengle Shi

Spatial Databases


Efficient Landmark-Based Candidate Generation for kNN Queries on Road Networks

The k nearest neighbor (kNN) query on road networks finds the k closest points of interest (POIs) by network distance from a query point. A past study showed that a kNN technique using a simple Euclidean distance heuristic to generate candidate POIs significantly outperforms more complex techniques. While Euclidean distance is an effective lower bound when network distances represent physical distance, its accuracy degrades greatly for metrics such as travel time. Landmarks have been used to compute tighter lower bounds in such cases, however past attempts to use them in kNN querying failed to retrieve candidates efficiently. We present two techniques to address this problem, one using ordered Object Lists for each landmark and another using a combination of landmarks and Network Voronoi Diagrams (NVDs) to only compute lower bounds to a small subset of objects that may be kNNs. Our extensive experimental study shows these techniques (particularly NVDs) significantly improve on the previous best techniques in terms of both heuristic and query time performance.

Tenindra Abeywickrama, Muhammad Aamir Cheema

MinSum Based Optimal Location Query in Road Networks

Consider a road network G on which a set C of clients and a set S of servers are located. Optimal location query (OLQ) in road networks is to find a location such that when a new server is set up at this location, a certain objective function computed based on the clients and the servers serving the clients is optimized. In this paper, we study the OLQ with the MinSum objective function in road networks, namely the MinSum query. This problem has been studied before, but the state-of-the-art is still not efficient enough. We propose an efficient algorithm based on the two-level pruning technique. We also study the extension of the MinSum query problem, namely the optimal multiple-location MinSum query. Since this extension is shown to be NP-hard, we propose a greedy algorithm. Moreover, we give an approximate guarantee for our solution. Extensive experiments on the real road networks were conducted to show the efficiency of our proposed algorithms.

Lv Xu, Ganglin Mai, Zitong Chen, Yubao Liu, Genan Dai

Efficiently Mining High Utility Co-location Patterns from Spatial Data Sets with Instance-Specific Utilities

Traditional spatial co-location pattern mining attempts to find the subsets of spatial features whose instances are frequently located together in some regions. Most previous studies take the prevalence of co-locations as the interestingness measure. However, it is more meaningful to take the utility value of each instance into account in spatial co-location pattern mining in some cases. In this paper, we present a new interestingness measure for mining high utility co-location patterns from spatial data sets with instance-specific utilities. In the new interestingness measure, we take the intra-utility and inter-utility into consideration to capture the global influence of each feature in co-locations. We present a basic algorithm for mining high utility co-locations. In order to reduce high computational cost, some pruning strategies are given to improve the efficiency. The experiments on synthetic and real-world data sets show that the proposed method is effective and the pruning strategies are efficient.

Lizhen Wang, Wanguo Jiang, Hongmei Chen, Yuan Fang

Real Time Data Processing


Supporting Real-Time Analytic Queries in Big and Fast Data Environments

Recently there has been a significant interest to perform real-time analytical queries in systems that can handle both “big data” and “fast data”. In this paper, we propose an approximate answering approach, called ROSE, which can manage the big and fast data streams and support complex analytical queries against the data streams. To achieve this goal, we start with an analysis of existing query processing techniques in big data systems to understand the requirements of building a distributed analytic sketch. We then propose a sampling-based sketch that can extract multi-faced samples from asynchronous data streams, and augment its usability with accuracy-lossless distributed sketch construction operations, such as splitting, merging and union. The experimental results with real-world data sets indicate that compared with state-of-the-art approximate answering engine BlinkDB, our techniques can obtain more accurate estimates and improve 2 times of system throughput. When compared with distributed memory-computing system Spark, our system can achieve 2 orders of magnitude improvement on query response time.

Guangjun Wu, Xiaochun Yun, Chao Li, Shupeng Wang, Yipeng Wang, Xiaoyu Zhang, Siyu Jia, Guangyan Zhang

Boosting Moving Average Reversion Strategy for Online Portfolio Selection: A Meta-learning Approach

In this paper, we study the online portfolio selection problem from the perspective of meta learning for mean reversion. The online portfolio selection problem aims to maximize the final accumulated wealth by rebalancing the portfolio at each time period based on the portfolio prices announced before. Mean Reversion is a typical principle in portfolio theory and strategies that utilize this principle achieve the superior empirical performances so far. However there are some important limits of existing Mean Reversion strategies: First, the mean reversion strategies have to set a fixed window size, where the optimal window size can only be chosen in hindsight. Second, most existing mean reversion techniques ignore the temporal heterogeneity of historical price relatives from different periods. Moreover, most mean reversion methods suffer from noises and outliers in the data, which greatly affects the performances. In order to tackle the limits of previous approaches, we exploit mean reversion principle from a meta learning perspective and propose a boosting method for price relative prediction. More specifically, we generate several experts where each expert follows a specific mean reversion policy and predict the final price relatives with meta learning techniques. The sampling of multiple experts involves mean reversion strategies with various window sizes; while the meta learning technique brings temporal heterogeneity and stronger robustness for prediction. We adopt online passive-aggressive learning for portfolio optimization with the predicted price relatives. Extensive experiments have been conducted on real-world datasets and our approach outperforms the state-of-the-art approaches significantly.

Xiao Lin, Min Zhang, Yongfeng Zhang, Zhaoquan Gu, Yiqun Liu, Shaoping Ma

Continuous Summarization over Microblog Threads

With the dramatic growth of social media users, microblogs are created and shared at an unprecedented rate. The high velocity and large volumes of short text posts (microblogs) bring redundancies and noise, making it hard for users and analysts to elicit useful information. In this paper, we formalize the problem from a summarization angle – Continuous Summarization over Microblog Threads (CSMT), which considers three facets: information gain of the microblog dialogue, diversity, and temporal information. This summarization problem is different from the classic ones in two aspects: (i) It is considered over a large-scale, dynamic data with high updating frequency; (ii) the context between microblogs are taken into account. We first prove that the CSMT problem is NP-hard. Then we propose a greedy algorithm with ($$1-1/\mathrm{e}$$) performance guarantee. Finally we extend the greedy algorithm on the sliding window to continuously summarize microblogs for threads. Our experimental results on large-scale datasets show that our method is more superior than other two baselines in terms of summary diversity and information gain, with a close time cost to the best performed baseline.

Liangjun Song, Ping Zhang, Zhifeng Bao, Timos Sellis

Drawing Density Core-Sets from Incomplete Relational Data

Incompleteness is a ubiquitous issue and brings challenges to answer queries with completeness guaranteed. A density core-set is a subset of an incomplete dataset, whose completeness is approximate to the completeness of the entire dataset. Density core-sets are effective mechanisms to estimate completeness of queries on incomplete datasets. This paper studies the problems of drawing density core-sets on incomplete relational data. To the best of our knowledge, there is no such proposal in the past. (1) We study the problems of drawing density core-sets in different requirements, and prove the problems are all NP-Complete whether functional dependencies are given. (2) An efficient approximate algorithm to draw an approximate density core-set is proposed, where an approximate Knapsack algorithm and weighted sampling techniques are employed to select important candidate tuples. (3) Analysis of the proposed approximate algorithm shows the relative error between completeness of the approximate density core-set and that of a density core-set with same size is within a given relative error bound with high probability. (4) Experiments on both real-world and synthetic datasets demonstrate the effectiveness and efficiency of the algorithm.

Yongnan Liu, Jianzhong Li, Hong Gao

Big Data (Industrial)


Co-training an Improved Recurrent Neural Network with Probability Statistic Models for Named Entity Recognition

Named Entity Recognition (NER) is a subtask of information extraction in Natural Language Processing (NLP) field and thus being wildly studied. Currently Recurrent Neural Network (RNN) has become a popular way to do NER task, but it needs a lot of train data. The lack of labeled train data is one of the hard problems and traditional co-training strategy is a way to alleviate it. In this paper, we consider this situation and focus on doing NER with co-training using RNN and two probability statistic models i.e. Hidden Markov Model (HMM) and Conditional Random Field (CRF). We proposed a modified RNN model by redefining its activation function. Compared to traditional sigmoid function, our new function avoids saturation to some degree and makes its output scope very close to [0, 1], thus improving recognition accuracy. Our experiments are conducted ATIS benchmark. First, supervised learning using those models are compared when using different train data size. The experimental results show that it is not necessary to use whole data, even small part of train data can also get good performance. Then, we compare the results of our modified RNN with original RNN. 0.5% improvement is obtained. Last, we compare the co-training results. HMM and CRF get higher improvement than RNN after co-training. Moreover, using our modified RNN in co-training, their performances are improved further.

Yueqing Sun, Lin Li, Zhongwei Xie, Qing Xie, Xin Li, Guandong Xu

EtherQL: A Query Layer for Blockchain System

Blockchain - the innovation behind Bitcoin - enables people to exchange digital money with complete trust, and seems to be completely transforming the way we think about trust. While blockchain is designed for secured, immutable funds transfer in trustless and decentralized environment, the underlying storage of blockchain is very simple with only limited supports for data access. Moreover, blockchain data are highly compressed before flushing to hard disk, making it harder to have an insight of these valuable data set. In this work, we develop EtherQL, an efficient query layer for Ethereum – the most representative open-source blockchain system. EtherQL provides highly efficient query primitives for analyzing blockchain data, including range queries and top-k queries, which can be integrated with other applications with much flexibility. Moreover, EtherQL is designed to provide different levels of abstraction, which are suitable for data analysts, researchers and application developers.

Yang Li, Kai Zheng, Ying Yan, Qi Liu, Xiaofang Zhou

Optimizing Scalar User-Defined Functions in In-Memory Column-Store Database Systems

User-defined functions such as currency conversion and factory calendar are important ingredients in many business applications. Since currency conversion and factory calendar are expensive user-defined functions, optimizing these functions is essential to high performance business applications. We optimize scalar user-defined functions by caching function call results. In this paper we investigate which method for function result caching is best in the context of in-memory column-store database systems. Experiments show that our method, which implements a function result cache as an array, combined with SAP HANA in-memory column store provides the high performance required by real-time global business applications.

Cheol Ryu, Sunho Lee, Kihong Kim, Kunsoo Park, Yongsik Kwon, Sang Kyun Cha, Changbin Song, Emanuel Ziegler, Stephan Muench

GPS-Simulated Trajectory Detection

Due to the prevalence of GPS-enabled devices and wireless communication technology, spatial trajectories have become the basis of many location based applications, e.g., Didi. However, trajectory data suffers low quality problems causing by sensor errors and artificial forgeries. Sensor errors are inevitable while forgeries are always constructed on bad purpose. For example, some Didi drivers use GPS simulators to generate forgery trajectories and make fake transactions. In this work we aim to distinguish whether a given trajectory is a GPS simulated trajectory. By formulating this task as the problem of traffic speed extracting and irregular measuring, we propose a simulated trajectory detection framework. In traffic speed extracting phase, we first divide time into time slots and then extract the regular speed of each road during each time slot. In irregular measuring phase, we propose three methods to measure the distance between the speed of the given trajectory and the real traffic speeds. For empirical study, we apply our solution to a real trajectory dataset and have found that the simulated trajectory detection framework can detect most forgery trajectories.

Han Su, Wei Chen, Rong Liu, Min Nie, Bolong Zheng, Zehao Huang, Defu Lian

Social Networks and Graphs (Industrial)


Predicting Academic Performance via Semi-supervised Learning with Constructed Campus Social Network

Wide attention has been recently paid to academic performance prediction, due to its potentials of early warning and subsequent in-time intervention. However, there are few studies to consider the effect of social influence at predicting academic performance. The major challenge comes from the difficulty of collecting a precise friend list for students. To this end, we first construct students’ social relationship based on their campus behavior, and then predicts academic performance using constructed social network by semi-supervised learning. We evaluate the proposed algorithm on over 5,000 students with more than 14M behavior records. The evaluation results show the potential value of campus social network for predicting academic performance and the effectiveness of the proposed algorithm.

Huaxiu Yao, Min Nie, Han Su, Hu Xia, Defu Lian

Social User Profiling: A Social-Aware Topic Modeling Perspective

Social user profiling is an analytical process that delivers an in-depth blueprint of users’ personal characteristics in social networks, which can enable a wide range of applications, such as personalized recommendation and targeted marketing. While social user profiling has attracted a lot of attention in the past few years, it is still very challenging to collaboratively model both user-centric information and social network structure. To this end, in this paper we develop an analytic framework for solving the social user profiling problem. Specifically, we first propose a novel social-aware semi-supervised topic model, i.e., User Profiling based Topic Model (UPTM), which can reconcile the observed user characteristics and social network structure for discovering the latent reasons behind social connections and further extracting users’ potential profiles. In addition, to improve the profiling performance, we further develop a label propagation strategy for refining the profiling results of UPTM. Finally, we conduct extensive evaluations with a variety of real-world data, where experimental results demonstrate the effectiveness of our proposed modeling method.

Chao Ma, Chen Zhu, Yanjie Fu, Hengshu Zhu, Guiquan Liu, Enhong Chen

Cost-Effective Data Partition for Distributed Stream Processing System

Data skew and dynamics greatly affect throughput of stream processing system. It requires to design a high-efficient partition method to evenly distribute workload in a distributed and parallel. Previous research mainly focuses on load balancing adjustment based on key-as-granularity or tuple-as-granularity, both of which have their own limitations such as clumsy balance activities or expensive network cost. In this paper, we present a comprehensive cost model for partitioning method, which makes a synthesis estimation of memory, CPU and network resource utilization. Based on cost model, we propose a novel load balancing adjustment algorithm, which adopts the idea of “Split keys on demand and Merge keys as far as possible”, and is adaptive to different skewed workload. Our evaluation demonstrates that our method outperforms the state-of-the-art partitioning schemes while maintaining high throughput and resource utilization.

Xiaotong Wang, Junhua Fang, Yuming Li, Rong Zhang, Aoying Zhou

A Graph-Based Push Service Platform

Learning users’ preference and making recommendations is critical in information-exploded environment. There are two typical modes for recommendation, known as pull and push, which respectively account for recommendation inside and outside the item market. While previously most recommender systems adopt only pull-mode, push-mode becomes popular in today’s mobile environment. This paper presents a push recommendation platform successfully deployed for Huawei App Store, which has reached 0.3 billion registered users and 1.2 million Apps by 2016. Among the various modules in developing this push platform, we recognized the task of target user group discovery to be most essential in terms of CTR. We explored various algorithmic choices for mining target user group, and highlighted one based on recent advance in graph mining, the Partially Absorbing Random Walk [13], which leads to substantial improvement for our push recommendation, compared to the state-of-the-art including the popular PageRank. We also covered our practice in deploying our push platform in both single server and distributed cluster.

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He

Edge Influence Computation in Dynamic Graphs

Reachability queries are of great importance in many research and application areas, including general graph mining, social network analysis and so on. Many approaches have been proposed to compute whether there exists one path from one node to another node in a graph. Most of these approaches focus on static graphs, however in practice dynamic graphs are more common. In this paper, we focus on handling graph reachability queries in dynamic graphs. Specifically we investigate the influence of a given edge in the graph, aiming to study the overall reachability changes in the graph brought by the possible failure/deletion of the edge. To this end, we firstly develop an efficient update algorithm for handling edge deletions. We then define the edge influence concept and put forward a novel computation algorithm to accelerate the computation of edge influence. We evaluate our approach using several real world datasets. The experimental results show that our approach outperforms traditional approaches significantly.

Yongrui Qin, Quan Z. Sheng, Simon Parkinson, Nickolas J. G. Falkner



DKGBuilder: An Architecture for Building a Domain Knowledge Graph from Scratch

In recent years, we have witnessed the technical advances in general knowledge graph construction. However, for a specific domain, harvesting precise and fine-grained knowledge is still difficult due to the long-tail property of entities and relations, together with the lack of high-quality, wide-coverage data sources. In this paper, a domain knowledge graph construction system DKGBuilder is presented. It utilizes a template-based approach to extract seed knowledge from semi-structured data. A word embedding based projection model is proposed to extract relations from text under the framework of distant supervision. We further employ an is-a relation classifier to learn a domain taxonomy using a bottom-up strategy. For demonstration, we construct a Chinese entertainment knowledge graph from Wikipedia to support several knowledge service functionalities, containing over 0.7M facts with 93.1% accuracy.

Yan Fan, Chengyu Wang, Guomin Zhou, Xiaofeng He

CLTR: Collectively Linking Temporal Records Across Heterogeneous Sources

A huge volume of data on the Web are continually made available, which provides users rich amount of information to learn more about entities. In addition to attribute values of entities, there is often additional relational information, such as friendship on social networks, coauthorship of papers. However, to understand how these facts across heterogeneous data sources are related is challenging for users due to entity evolution over time. In this paper, we propose a novel system to help users find how records are temporally related and understand how entity profiles evolve over time. Our system is able to Collectively Link Temporal Records (CLTR) by taking advantage of evidence from both attribute and relational information on multiple sources. We demonstrate how CLTR allows users to explore time-varying history of targeted entities and visualizes multi-type relations among entities.

Yanyan Zou, Kasun S. Perera

PhenomenaAssociater: Linking Multi-domain Spatio-Temporal Datasets

This paper focuses on the demonstration of an analytics dashboard application for analyzing interesting spatio-temporal associations between anomalies across multiple spatio-temporal datasets, potentially from disparate domains, to find interesting hidden relationships. The proposed system is intended to analyze spatio-temporal data across multiple phenomena from disparate domains (for example traffic and weather) to identify interesting phenomena relationships by linking anomalies from each of these domain datasets. This web-based dashboard application developed in R Shiny [1] provides interactive visualizations to quantify the multi-domain associations. The application uses a novel framework of algorithms and quantification metrics to associate these anomalies across multiple domains using spatial and temporal proximity and influence metrics.

Prathamesh Walkikar, Vandana P. Janeja

VisDM–A Data Stream Visualization Platform

Visual Data stream Monitor (VisDM) is a new approach to integrate a visual programming language with a data stream management system (DSMS) to support the construction, configuration, and visualization of data stream applications through a set of building blocks, Visual Data Flow Components (VDFCs). This functionality is provided by extending the LabVIEW visual programming platform to support easy declarative specification of continuous visualizations of continuous query results. With actor-based data flows, visualization of data stream output becomes more accessible.

Lars Melander, Kjell Orsborn, Tore Risch, Daniel Wedlund


Weitere Informationen

Premium Partner