Keynote Talk

Challenges for Dataset Search

Ranked search of datasets has emerged as a need as shared scientific archives grow in size and variety. Our own have shown that IR-style, feature-based relevance scoring can be an effective tool for data discovery in scientific archives. However, maintaining interactive response times as archives scale will be a challenge. We report here on our exploration of performance techniques for Data Near Here, a dataset search service. We present a sample of results evaluating

filter-restart

techniques in our system, including two variations,

adaptive relaxation

and

contraction

. We then outline further directions for research in this domain.

David Maier, V. M. Megler, Kristin Tufte

Invited Paper from Receipients of Ten-Year Best Paper Award

Secure Computation on Outsourced Data: A 10-year Retrospective

This paper outlines the “behind the scene” story of how we wrote the DASFAA 2004 paper on secure computation on outsourced data to support SQL aggregation queries. We then describe some of the research we have done following the paper.

Hakan Hacıgümüş, Bala Iyer, Sharad Mehrotra

Big Data Management

Online Indexing and Distributed Querying Model-View Sensor Data in the Cloud

As various kinds of sensors penetrate our daily life (e.g., sensor networks for environmental monitoring), the efficient management of massive amount of sensor data becomes increasingly important at present. Traditional sensor data management systems based on relational database lack scalability to accommodate large-scale sensor data efficiently. Consequently, distributed key-value stores in the cloud is becoming the prime tool to manage sensor data. Meanwhile, model-view sensor data management stores the sensor data in the form of modelled segments. However, currently there is no index and query optimizations upon the modelled segments in the cloud, which results in full table scan in the worst case. In this paper, we propose an innovative model index for sensor data segments in key-value stores (KVM-index). KVM-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store. This composite structure enables to update new incoming sensor data segments with constant network I/O. Second, for time (or value)-range and point queries a MapReduce-based approach is designed to process the discrete predicate-related ranges of KVM-index, thereby eliminating computation and communication overheads incurred by accessing irrelevant parts of the index table in conventional MapReduce programs. Finally, we propose a cost based adaptive strategy for the KVM-index-MapReduce framework to process composite queries. As proved by extensive experiments, our approach outperforms in query response time both MapReduce-based processing of the raw sensor data and multiple alternative approaches of model-view sensor data.

Tian Guo, Thanasis G. Papaioannou, Hao Zhuang, Karl Aberer

Discovery of Areas with Locally Maximal Confidence from Location Data

A novel algorithm is presented for discovering areas having locally maximized confidence of an association rule on a collection of location data. Although location data obtained from GPS-equipped devices have promising applications, those GPS points are usually not uniformly distributed in two-dimensional space. As a result, substantial insights might be missed by using data mining algorithms that discover admissible or rectangular areas under the assumption that the GPS data points are distributed uniformly. The proposed algorithm composes transitively connected groups of irregular meshes that have locally maximized confidence. There is thus no need to assume the uniformity, which enables the discovery of areas not limited to a certain class of shapes. Iterative removal of the meshes in accordance with the local maximum property enables the algorithm to perform 50 times faster than state-of-the-art ones.

Hiroya Inakoshi, Hiroaki Morikawa, Tatsuya Asai, Nobuhiro Yugami, Seishi Okamoto

Multi-way Theta-Join Based on CMD Storage Method

In the era of the Big Data, how to analyze such a vast quantity of data is a challenging problem, and conducting a multi-way theta-join query is one of the most time consuming operations. MapReduce has been mentioned most in the massive data processing area and some join algorithms based on it have been raised in recent years. However, MapReduce paradigm itself may not be suitable to some scenarios and multi-way theta-join seems to be one of them. Many multi- way theta-join algorithms on traditional parallel database have been raised for many years, but no algorithm has been mentioned on the CMD (coordinate modulo distribution) storage method, although some algorithms on equal-join have been proposed. In this paper, we proposed a multi-way theta-join method based on CMD, which takes the advantage of the CMD storage method. Experiments suggest that it’s a valid and efficient method which achieves significant improvement compared to those applied on the MapReduce.

Lei Li, Hong Gao, Mingrui Zhu, Zhaonian Zou

MIGSOM: A SOM Algorithm for Large Scale Hyperlinked Documents Inspired by Neuronal Migration

The SOM (Self Organizing Map), one of the most popular unsupervised machine learning algorithms, maps high-dimensional vectors into low-dimensional data (usually a 2-dimensional map). The SOM is widely known as a “scalable” algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although a number of studies on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce

MIGSOM

, an SOM algorithm inspired by new discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation in detail, and show the practicality of the algorithm in several experiments. We applied MIGSOM to not only experimental data sets but also a large scale real data set: Wikipedia’s hyperlink data.

Kotaro Nakayama, Yutaka Matsuo

Indexing and Query Processing

Scalable Numerical Queries by Algebraic Inequality Transformations

To enable historical analyses of logged data streams by SQL queries, the Stream Log Analysis System (SLAS) bulk loads data streams derived from sensor readings into a relational database system. SQL queries over such log data often involve numerical conditions containing inequalities, e.g. to find suspected deviations from normal behavior based on some function over measured sensor values. However, such queries are often slow to execute, because the query optimizer is unable to utilize ordered indexed attributes inside numerical conditions. In order to speed up the queries they need to be reformulated to utilize available indexes. In SLAS the query transformation algorithm AQIT (Algebraic Query Inequality Transformation) automatically transforms SQL queries involving a class of algebraic inequalities into more scalable SQL queries utilizing ordered indexes. The experimental results show that the queries execute substantially faster by a commercial DBMS when AQIT has been applied to preprocess them.

Thanh Truong, Tore Risch

SAQR: An Efficient Scheme for Similarity-Aware Query Refinement

Query refinement techniques enable database systems to automatically adjust a submitted query so that its result satisfies some specified constraints. While current techniques are fairly successful in generating refined queries based on cardinality constraints, they are rather oblivious to the (dis)similarity between the input query and its corresponding refined version. Meanwhile, enforcing a

similarity-aware

query refinement is a rather challenging task as it would require an exhaustive examination of the large space of possible query refinements. To address this challenge, we propose a novel scheme for efficient

Similarity-aware Query Refinement (SAQR)

. SAQR aims to balance the tradeoff between satisfying the cardinality and similarity constraints imposed on the refined query so that to maximize its overall benefit to the user. To achieve that goal, SAQR implements efficient strategies to minimize the costs incurred in exploring the available search space. In particular, SAQR utilizes both similarity-based and cardinality-based pruning techniques to bound the search space and quickly find a refined query that meets the user expectations. Our experimental evaluation shows the scalability exhibited by SAQR under various workload settings, and the significant benefits it provides.

Abdullah Albarrak, Mohamed A. Sharaf, Xiaofang Zhou

Concurrent Execution of Mixed Enterprise Workloads on In-Memory Databases

In the world of enterprise computing, single applications are often classified either as transactional or analytical. From a data management perspective, both application classes issue a database workload with commonly agreed characteristics. However, traditional database management systems (DBMS) are typically optimized for one or the other. Today, we see two trends in enterprise applications that require bridging these two workload categories: (1) enterprise applications of both classes access a single database instance and (2) longer-running, analytical-style queries issued by transactional applications. As a reaction to this change, in-memory DBMS on multi-core CPUs have been proposed to handle the mix of transactional and analytical queries in a single database instance. However, running heterogeneous queries potentially causes situations where longer running queries block shorter running queries from execution. A task-based query execution model with priority-based scheduling allows for an effective prioritization of query classes. This paper discusses the impact of task granularity on responsiveness and throughput of an in-memory DBMS. We show that a larger task size for long running operators negatively affects the response time of short running queries. Based on this observation, we propose a solution to limit the maximum task size with the objective of controlling the mutual performance impact of query classes.

Johannes Wust, Martin Grund, Kai Hoewelmeyer, David Schwalb, Hasso Plattner

On Data Partitioning in Tree Structure Metric-Space Indexes

Tree structure metric-space indexing methods recursively partition data according to their distances to a set of selected reference points (also called pivots). There are two basic forms of data partitioning: ball partition and General Hyper-plane (GH) partition. Most existing work only shows their superiority experimentally, and little theoretical proof is found. We propose an approach to unify existing data partitioning methods and analyze their performance theoretically. First, in theory, we unify the two basic forms of partitioning by proving that there are rotations of each other. Second, we show several theoretical or experimental results, which are able to indicate that ball partition outperforms GH partition. Our work takes a step forward in the theoretical study of metric-space indexing and is able to give a guideline of future index design.

Rui Mao, Sheng Liu, Honglong Xu, Dian Zhang, Daniel P. Miranker

Graph Data Management

Improving Performance of Graph Similarity Joins Using Selected Substructures

Similarity join of complex structures is an important operation in managing graph data. In this paper, we investigate the problem of graph similarity join with edit distance constraints. Existing algorithms extract substructures – either rooted trees or simple paths – as features, and transform the edit distance constraint into a weaker count filtering condition. However, the performance suffers from the heavy overlapping or low selectivity of substructures. To resolve the issue, we first present a general framework for substructure-based similarity join and a tighter count filtering condition. It is observed under the framework that using either too few or too many substructures can result in poor filtering performance. Thus, we devise an algorithm to select substructures for filtering. The proposed techniques are integrated into the framework, constituting a new algorithm, whose superiority is witnessed by experimental results.

Xiang Zhao, Chuan Xiao, Wenjie Zhang, Xuemin Lin, Jiuyang Tang

Linear Path Skyline Computation in Bicriteria Networks

A bicriteria network is an interlinked data set where edges are labeled with two cost attributes. An example is a road network where edges represent road segments being labeled with traversal time and energy consumption. To measure the proximity of two nodes in network data, the common method is to compute a cost optimal path between the nodes. In a bicriteria network, there often is no unique path being optimal w.r.t. both types of cost. Instead, a path skyline describes the set of non-dominated paths that are optimal under varying preference functions. In this paper, we examine the subset of the path skyline which is optimal under the most common type of preference function, the weighted sum. We will examine characteristics of this more strict domination relation. Furthermore, we introduce techniques to efficiently maintain the set of linearly non-dominated paths. Finally, we will introduce a new algorithm to compute all linearly non-dominated paths denoted as linear path skyline. In our experimental evaluation, we will compare our new approach to other methods for computing the linear skyline and efficient approaches to compute path skylines.

Michael Shekelyan, Gregor Jossé, Matthias Schubert, Hans-Peter Kriegel

Label and Distance-Constraint Reachability Queries in Uncertain Graphs

A fundamental research problem concerning edge-labeled uncertain graphs is called label and distance-constraint reachability query (LDCR): Given two vertices

u

and

v

, query the probability that

u

can reach

v

through paths whose labels and lengths are constrained by a label set and a distance threshold separately. Considering LDCR is not tractable as a #P-complete problem, we aim to propose effective and efficient approximate solutions for it. We first introduce a subpath-based filtering strategy which combines divide-conquer algorithm and branch path pruning to compress the original graph and reduce the scale of DC-tree. Then to approximate LDCR, several estimators are presented based on different sampling mechanisms and a path/cut bound is proposed to prune large-deviation values. An extensive experimental evaluation on both real and synthetic datasets demonstrates that our approaches exhibit prominent performance in term of query time and accuracy.

Minghan Chen, Yu Gu, Yubin Bao, Ge Yu

Privacy-Preserving Reachability Query Services

Due to the massive volume of graph data from a wide range of recent applications and resources required to process numerous queries at large scale, it is becoming economically appealing to outsource graph data to a third-party service provider (

$\mathcal{SP}$

), to provide query services. However,

$\mathcal{SP}$

cannot always be trusted. Hence, data owners and query clients may prefer not to expose their data graphs and queries. This paper studies privacy-preserving query services for a fundamental query for graphs namely the reachability query where

both

clients’ queries and the structural information of the owner’s data are protected. We propose

privacy-preserving 2-hop labeling

(pp-2-hop) where the queries are computed in an encrypted domain and the input and output sizes of any queries are indistinguishable. We analyze the security of pp-2-hop with respect to ciphertext only and size based attacks. We verify the performance of pp-2-hop with an experimental study on both synthetic and real-world datasets.

Shuxiang Yin, Zhe Fan, Peipei Yi, Byron Choi, Jianliang Xu, Shuigeng Zhou

Spatio-temporal Data Management

SKY R-tree: An Index Structure for Distance-Based Top-k Query

Searches for objects associated with location information and non-spatial attributes have increased significantly over the years. To address this need, a top-k query may be issued by taking into account both the location information and non-spatial attributes. This paper focuses on a distance-based top-k query which retrieves the best objects based on distance from candidate objects to a query point as well as other non-spatial attributes. In this paper, we propose a new index structure and query processing algorithms for distance-based top-k queries. This new index, called

SKY R-tree

, drives on the strengths of R-tree and Skyline algorithm to efficiently prune the search space by exploring both the spatial proximity and non-spatial attributes. Moreover, we propose a variant of SKY R-tree, called

S2KY R-tree

which incorporates a similarity measure of non-spatial attributes. We demonstrate, through extensive experimentation, that our proposals perform very well in terms of I/O costs and CPU time.

Yuya Sasaki, Wang-Chien Lee, Takahiro Hara, Shojiro Nishio

Causal Structure Discovery for Spatio-temporal Data

Numerous causal structure discovery methods have been proposed recently but none of them has taken possible time-varying structure into consideration. In this paper, we introduce a notion of causal time-varying dynamic Bayesian network (CTV-DBN) and define a causal boundary to govern cross time information sharing. Although spatio-temporal data have been investigated by multiple disciplines; by reducing structure discovery into a set of optimization problems, CTV-DBN is a scalable solution targeting large datasets. CTV-DBN is constructed using asymmetric kernels to address sample scarcity and to adhere to causal principles; while maintaining good variance and bias trade-off. We explore trajectory data collected from mobile devices which are known to exhibit heterogeneous patterns, data sparseness and distribution skewness. Contrary to a naïve method to divide space by grids, we capture the moving objects’ view of space by using density-based clustering to overcome the problems. In our experiments, CTV-DBN is used to reveal the evolution of time-varying region macro structure in a ring road system based on trajectories, and to obtain a local time-varying road junction dependency structure based on static traffic flow sensor data.

Victor W. Chu, Raymond K. Wong, Wei Liu, Fang Chen

Efficient Processing of Which-Edge Questions on Shortest Path Queries

In this paper, we formulate a novel problem called

Which-Edge

question on shortest path queries. Specifically, this problem aims to find k edges that minimize the total distance for a given set of shortest path queries on a graph. This problem has important applications in logistics, urban planning, and network planning. We show the NP-hardness of the problem, as well as present efficient algorithms that compute highly accurate results in practice. Experimental evaluations are carried out on real datasets and results show that our algorithms are scalable and return high quality solutions.

Petrie Wong, Duncan Yung, Ming Hay Luk, Eric Lo, Man Lung Yiu, Kenny Q. Zhu

Reconciling Multiple Categorical Preferences with Double Pareto-Based Aggregation

Given a set of objects and a set of user preferences, both defined over a set of categorical attributes, the

Multiple Categorical Preferences

(MCP) problem is to determine the objects that are considered preferable by all users. In a naïve interpretation of MCP, matching degrees between objects and users are aggregated into a single score which ranks objects. Such an approach, though, obscures and blurs individual preferences, and can be unfair, favoring users with precise preferences and objects with detailed descriptions. Instead, we propose an objective and fair interpretation of the MCP problem, based on two Pareto-based aggregations. We introduce an efficient approach that is based on a transformation of the categorical attribute values and an index structure. Moreover, we propose an extension for controlling the number of returned objects. An experimental study on real and synthetic data finds that our index-based technique is an order of magnitude faster than a baseline approach, scaling up to millions of objects.

Nikos Bikakis, Karim Benouaret, Dimitris Sacharidis

Database for Emerging Hardware

CARIC-DA: Core Affinity with a Range Index for Cache-Conscious Data Access in a Multicore Environment

In recent years,the number of cores on a chip has been growing exponentially, enabling an ever-increasing number of processes to execute in parallel. Having been developed originally for single-core processors, database (DB) management systems (DBMSs) running on multicore processors suffer from cache conflicts as the number of concurrently executing DB processes (DBPs) increases. In this paper, we propose CARIC-DA, middleware for achieving higher performance in DBMSs on multicore processors by reducing cache misses with a new cache-conscious dispatcher for concurrent queries. CARIC-DA logically range-partitions the data set into multiple subsets. This enables different processor cores to access different subsets by ensuring that different DBPs are pinned to different cores and by dispatching queries to DBPs according to the data partitioning information. In this way, CARIC-DA is expected to achieve better performance via a higher cache hit rate for each core’s private cache. It can also balance the loads between cores by changing the range of each subset. Note that CARIC-DA is

pure

middleware, which avoids any modification to existing operating systems (OSs) and DBMSs, thereby making it more practical. We implemented a prototype that uses unmodified existing Linux and PostgreSQL environments. The performance evaluation against benchmarks revealed that CARIC-DA achieved improved cache hit rates and higher performance.

Fang Xi, Takeshi Mishima, Haruo Yokota

Approximating an Energy-Proportional DBMS by a Dynamic Cluster of Nodes

The most energy-efficient configuration of a single-server DBMS is the highest performing one, if we exclusively focus on specific applications where the DBMS can steadily run in the peak-performance range. However, typical DBMS activity levels—or their average system utilization—are much lower and their energy use is far from being

energy proportional

. Built of commodity hardware,

WattDB

—a distributed DBMS—runs on a cluster of computing nodes where energy proportionality is approached by dynamically adapting the cluster size. In this work, we combine our previous findings on energy-proportional storage layers and query processing into a single, transactional DBMS. We verify our vision by a series of benchmarks running OLTP and OLAP queries with varying %intensity. degrees of parallelism. These experiments illustrate that WattDB dynamically adjusts to the workload present and reconfigures itself to satisfy performance demands while keeping its energy consumption at a minimum.

Daniel Schall, Theo Härder

APSkyline: Improved Skyline Computation for Multicore Architectures

The trend towards in-memory analytics and CPUs with an increasing number of cores calls for new algorithms that can efficiently utilize the available resources. This need is particularly evident in the case of CPU-intensive query operators. One example of such a query with applicability in data analytics is the skyline query. In this paper, we present APS kyline, a new approach for multicore skyline query processing, which adheres to the partition-execute-merge framework. Contrary to existing research, we focus on the partitioning phase to achieve significant performance gains, an issue largely overlooked in previous work in multicore processing. In particular, APS kyline employs an angle-based partitioning approach, which increases the degree of pruning that can be achieved in the execute phase, thus significantly reducing the number of candidate points that need to be checked in the final merging phase. APS kyline is extremely efficient for hard cases of skyline processing, as in the cases of datasets with large skyline result sets, where it is meaningful to exploit multicore processing.

Stian Liknes, Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg

Data Mining

Greedy Filtering: A Scalable Algorithm for K-Nearest Neighbor Graph Construction

Finding the

k

-nearest neighbors for every node is one of the most important data mining tasks as a primitive operation in the field of Information Retrieval and Recommender Systems. However, existing approaches to this problem do not perform as well when the number of nodes or dimensions is scaled up. In this paper, we present

greedy filtering

, an efficient and scalable algorithm for finding an approximate

k

-nearest neighbor graph by filtering node pairs whose large value dimensions do not match at all. In order to avoid skewness in the results and guarantee a time complexity of

O

(

n

), our algorithm chooses essentially a fixed number of node pairs as candidates for every node. We also present a faster version of greedy filtering based on the use of inverted indices for the node prefixes. We conduct extensive experiments in which we (i) compare our approaches to the state-of-the-art algorithms in seven different types of datasets, and (ii) adopt other algorithms in related fields (similarity join, top-k similarity join and similarity search fields) to solve this problem and evaluate them. The experimental results show that greedy filtering guarantees a high level of accuracy while also being much faster than other algorithms for large amounts of high-dimensional data.

Youngki Park, Sungchan Park, Sang-goo Lee, Woosung Jung

On Mining Proportional Fault-Tolerant Frequent Itemsets

Mining robust frequent itemsets has attracted much attention due to its wide applications in noisy data. In this paper, we study the problem of mining proportional fault-tolerant frequent itemsets in a large transactional database. A fault-tolerant frequent itemset allows a small amount of errors in each item and each supporting transaction. This problem is challenging since the anti-monotone property does not hold for candidate generation and the problem of fault-tolerant support counting is known to be NP-hard. We propose techniques that substantially speed up the state-of-the-art algorithm for the problem. We also develop an efficient heuristic method to solve an approximation version of the problem. Our experimental results show that the proposed speedup techniques are effective. In addition, our heuristic algorithm is much faster than the exact algorithms while the error is acceptable.

Shengxin Liu, Chung Keung Poon

An Efficient K-means Clustering Algorithm on MapReduce

As an important approach to analyze the massive data set, an efficient

k

-means implementation on MapReduce is crucial in many applications. In this paper we propose a series of strategies to improve the efficiency of

k

-means for massive high-dimensional data points on MapReduce. First, we use locality sensitive hashing (

LSH

) to map data points into buckets, based on which, the original data points is converted into the weighted representative points as well as the outlier points. Then an effective center initialization algorithm is proposed, which can achieve higher quality of the initial centers. Finally, a pruning strategy is proposed to speed up the iteration process by pruning the unnecessary distance computation between centers and data points. An extensive empirical study shows that the proposed techniques can improve both efficiency and accuracy of

k

-means on MapReduce greatly.

Qiuhong Li, Peng Wang, Wei Wang, Hao Hu, Zhongsheng Li, Junxian Li

Efficient Mining of Density-Aware Distinguishing Sequential Patterns with Gap Constraints

Distinguishing sequential patterns are useful in characterizing a given sequence class and contrasting that class against other sequence classes. This paper introduces the density concept into distinguishing sequential pattern mining, extending previous studies which considered gap and support constraints. Density is concerned with the number of times of given patterns occur in individual sequences; it is an important factor in many applications including biology, healthcare and financial analysis. We present gd-DSPMiner, a mining method with various pruning techniques, for mining density-aware distinguishing sequential patterns that satisfy density and gap, as well as support, constraints. With respect to computational speed, when the procedures related to density are masked gd-DSPMiner is substantially faster than previous distinguishing sequential pattern mining methods. Experiments on real data sets confirmed the effectiveness and efficiency of gd-DSPMiner in the general setting and the ability of gd-DSPMiner to discover density-aware distinguishing sequential patterns.

Xianming Wang, Lei Duan, Guozhu Dong, Zhonghua Yu, Changjie Tang

Probabilistic and Uncertain Data Management

Identifying Top k Dominating Objects over Uncertain Data

Uncertainty is inherent in many important applications, such as data integration, environmental surveillance, location-based services (LBS), sensor monitoring and radio-frequency identification (RFID). In recent years, we have witnessed significant research efforts devoted to producing probabilistic database management systems, and many important queries are re-investigated in the context of uncertain data models. In the paper, we study the problem of top

k

dominating query on multi-dimensional uncertain objects, which is an essential method in the multi-criteria decision analysis when an explicit scoring function is not available. Particularly, we formally introduce the top

k

dominating model based on the state-of-the-art top

k

semantic over uncertain data. We also propose effective and efficient algorithms to identify the top

k

dominating objects. Novel pruning techniques are proposed by utilizing the spatial indexing and statistic information, which significantly improve the performance of the algorithms in terms of CPU and I/O costs. Comprehensive experiments on real and synthetic datasets demonstrate the effectiveness and efficiency of our techniques.

Liming Zhan, Ying Zhang, Wenjie Zhang, Xuemin Lin

Probabilistic Reverse Top-k Queries

Ranking-aware query is one of the most fundamental queries in the database management field. The ranking query that returns top-

k

elements with maximal ranking scores according to a ranking function has been widely studied for decades. Recently, some researchers also focus on finding all customers who treat the given query object one of their top-

k

favorite elements, namely reverse top-

k

query. In such applications, each customer is described as a vector. However, none of the existing work has considered the uncertain data case for reverse top-

k

query, which is our focus. In this paper, we propose two methods to handle probabilistic reverse top-

k

query, namely BLS and ALS. As a basic solution, BLS approach checks each pair of user and product to find the query result. While as an advanced solution, ALS approach uses two pruning rules and historical information to significantly improve the efficiency. Both detailed analysis and experiments upon real and synthetic data sets illustrate the efficiency of our proposed methods.

Cheqing Jin, Rong Zhang, Qiangqiang Kang, Zhao Zhang, Aoying Zhou

Monitoring Probabilistic Threshold SUM Query Processing in Uncertain Streams

Many sources of data streams, e.g. geo-spatial streams derived from GPS-tracking systems or sensor streams provided by sensor networks are inherently uncertain due to impreciseness of sensing devices, due to outdated information, and due to human errors. In order to support data analysis on such data, aggregation queries are an important class of queries. This paper introduces a scalable approach for continuous probabilistic SUM query processing in uncertain stream environments. Here we consider an uncertain stream as a stream of uncertain values, each given by a probability distribution among the domain of the sensor values. Continuous probabilistic sum queries maintain the probability distribution of the sum of possible sensor values actually derived from the streaming environment. Our approach is able to efficiently compute the probabilistic SUM according to the possible world semantics, i.e., without any loss of information. Furthermore, we show the query’s answer can be efficiently updated in dynamic environments where attribute values change frequently. Our experimental results show that our approach computes probabilistic sum queries efficiently, and that processing queries incrementally instead of performing computation from scratch further boosts the performance of our algorithm significantly.

Nina Hubig, Andreas Züfle, Tobias Emrich, Matthias Renz, Mario A. Nascimento, Hans-Peter Kriegel

Efficient Processing of Probabilistic Group Nearest Neighbor Query on Uncertain Data

Uncertain data are inherent in various applications, and group nearest neighbor (GNN) query is widely used in many fields. Existing work for answering probabilistic GNN (PGNN) query on uncertain data are inefficient for the irregular shapes of uncertain regions. In this paper, we propose two pruning algorithms for efficiently processing PGNN query which are not sensitive to the shapes of uncertain regions. The

spatial pruning

algorithm utilizes the

centroid

point to efficiently filter out objects in consideration of their spatial locations; the

probabilistic pruning

algorithm derives more tighter bounds by partitioning uncertain objects. Furthermore, we propose a space partitioning structure in order to facilitate the partitioning process. Extensive experiments using both real and synthetic data show that our algorithms are not sensitive to the shapes of uncertain regions, and outperform the existing work by about 2-3 times under various settings.

Jiajia Li, Botao Wang, Guoren Wang, Xin Bi

Web and Social Data Management

Popularity Tendency Analysis of Ranking-Oriented Collaborative Filtering from the Perspective of Loss Function

Collaborative filtering (CF) has been the most popular approach for recommender systems in recent years. In order to analyze the property of a ranking-oriented CF algorithm directly and be able to improve its performance, this paper investigates the ranking-oriented CF from the perspective of loss function. To gain the insight into the popular bias problem, we also study the tendency of a CF algorithm in recommending the most popular items, and show that such popularity tendency can be adjusted through setting different parameters in our models. After analyzing two state-of-the-art algorithms, we propose in this paper two models using the generalized logistic loss function and the hinge loss function, respectively. The experimental results show that the proposed methods outperform the state-of-the-art algorithms on two real data sets.

Xudong Mao, Qing Li, Haoran Xie, Yanghui Rao

Rating Propagation in Web Services Reputation Systems: A Fast Shapley Value Approach

A new challenge in Web services reputation systems is to update the reputation of component services. As an emerging solution, rating propagation has received much attention recently. Current rating propagation algorithms either fail to fairly distribute the overall rating to component services or can realize a fair rating distribution at the cost of exponential time complexity. In this paper, we propose a fast Shapley value approach to propagate the overall rating of a composite service to its component services. Our approach ensures the fairness of rating propagation by using the advantage of the Shapley value, but significantly decreases its computational complexity from exponential to quadratic. Its fairness and efficiency are validated by experiments.

An Liu, Qing Li, Xiaofang Zhou, Lu Li, Guanfeng Liu, Yunjun Gao

CLUSM: An Unsupervised Model for Microblog Sentiment Analysis Incorporating Link Information

Microblog has become a popular platform for people to share their ideas, information and opinions. In addition to textual content data, social relations and user behaviors in microblog provide us additional link information, which can be used to improve the performance of sentiment analysis. However, traditional sentiment analysis approaches either focus on the plain text, or make simple use of links without distinguishing different effects of different types of links. As a result, the performance of sentiment analysis on microblog can not achieve obvious improvement. In this paper, we are the first to divide the links between microblogs into three classes. We further propose an unsupervised model called Content and Link Unsupervised Sentiment Model (CLUSM). CLUSM focuses on microblog sentiment analysis by incorporating the above three types of links. Comprehensive experiments were conducted to investigate the performance of our method. Experimental results showed that our proposed model outperformed the state of the art.

Gaoyan Ou, Wei Chen, Binyang Li, Tengjiao Wang, Dongqing Yang, Kam-Fai Wong

Location Oriented Phrase Detection in Microblogs

As a successful micro-blogging service,

Twitter

has demonstrated unprecedented popularity and international reach. Location extraction from micro-blogs (

tweets

) on this domain is an important challenge and can harness noisy but rich contents. Extracting location information can enable a variety of applications such as query-by-location, local advertising, crises awareness and also systems designed to provide information about events, points of interests (POIs) and landmarks. Considering the high throughput rate in

Twitter

space, we propose an approach to detect

location-oriented phrases

solely relying on tweet contents. The system finds associated phrases dedicated to each specific scalable geographical area. We have evaluated our approach based on real-world Twitter dataset from Australia. We conducted a comprehensive comparison between strong local terms (uni-word) and phrases (multi-words). Our experiments verify the system’s capabilities using multiple trending baselines and demonstrate that our phrase based approach can better specify locality instead of words.

Saeid Hosseini, Sayan Unankard, Xiaofang Zhou, Shazia Sadiq

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Keynote Talk

Challenges for Dataset Search

Invited Paper from Receipients of Ten-Year Best Paper Award

Secure Computation on Outsourced Data: A 10-year Retrospective

Big Data Management

Online Indexing and Distributed Querying Model-View Sensor Data in the Cloud

Discovery of Areas with Locally Maximal Confidence from Location Data

Multi-way Theta-Join Based on CMD Storage Method

MIGSOM: A SOM Algorithm for Large Scale Hyperlinked Documents Inspired by Neuronal Migration

Indexing and Query Processing

Scalable Numerical Queries by Algebraic Inequality Transformations

SAQR: An Efficient Scheme for Similarity-Aware Query Refinement

Concurrent Execution of Mixed Enterprise Workloads on In-Memory Databases

On Data Partitioning in Tree Structure Metric-Space Indexes

Graph Data Management

Improving Performance of Graph Similarity Joins Using Selected Substructures

Linear Path Skyline Computation in Bicriteria Networks

Label and Distance-Constraint Reachability Queries in Uncertain Graphs

Privacy-Preserving Reachability Query Services

Spatio-temporal Data Management

SKY R-tree: An Index Structure for Distance-Based Top-k Query

Causal Structure Discovery for Spatio-temporal Data

Efficient Processing of Which-Edge Questions on Shortest Path Queries

Reconciling Multiple Categorical Preferences with Double Pareto-Based Aggregation

Database for Emerging Hardware

CARIC-DA: Core Affinity with a Range Index for Cache-Conscious Data Access in a Multicore Environment

Approximating an Energy-Proportional DBMS by a Dynamic Cluster of Nodes

APSkyline: Improved Skyline Computation for Multicore Architectures

Data Mining

Greedy Filtering: A Scalable Algorithm for K-Nearest Neighbor Graph Construction

On Mining Proportional Fault-Tolerant Frequent Itemsets

An Efficient K-means Clustering Algorithm on MapReduce

Efficient Mining of Density-Aware Distinguishing Sequential Patterns with Gap Constraints

Probabilistic and Uncertain Data Management

Identifying Top k Dominating Objects over Uncertain Data

Probabilistic Reverse Top-k Queries

Monitoring Probabilistic Threshold SUM Query Processing in Uncertain Streams

Efficient Processing of Probabilistic Group Nearest Neighbor Query on Uncertain Data

Web and Social Data Management

Popularity Tendency Analysis of Ranking-Oriented Collaborative Filtering from the Perspective of Loss Function

Rating Propagation in Web Services Reputation Systems: A Fast Shapley Value Approach

CLUSM: An Unsupervised Model for Microblog Sentiment Analysis Incorporating Link Information

Location Oriented Phrase Detection in Microblogs

Backmatter

Premium Partner