Data Mining

Ensemble Pruning: A Submodular Function Maximization Perspective

Ensemble pruning looks for a subset of classifiers from a group of trained classifiers to make a better prediction performance for the test set. Recently, ensemble pruning techniques have attracted significant attention in the machine learning and the data mining community. Unlike previous heuristic approaches, in this paper we formalize the ensemble pruning problem as a function maximization problem to strike an optimal balance between quality of classifiers and diversity within the subset. Firstly, a quality and pairwise diversity combined framework is proposed and the function is proved to be submodular. Furthermore, we propose a submodular and monotonic function which is the composition of both quality and entropy diversity. Based on the theoretical analysis, although this maximization problem is still NP-hard, the greedy search algorithm with approximation guarantee of factor 1 -

$\frac{1}{e}$

is employed to get a near-optimal solution. Through the extensive experiments on 36 real datasets, our empirical studies demonstrate that our proposed approaches are capable of achieving superior performance and better efficiency.

Chaofeng Sha, Keqiang Wang, Xiaoling Wang, Aoying Zhou

Identify and Trace Criminal Suspects in the Crowd Aided by Fast Trajectories Retrieval

Aided by the wide deployment of surveillance cameras in cities nowadays, capturing the video of criminal suspects is much easier than before. However, it is usually hard to identify the suspects only according to the content of surveillance video due to the low resolution rate, insufficient brightness or occlusion. To address this problem, we consider the information of when and where a suspect is captured by the surveillance cameras and achieve a spatio-temporal sequence

ζ

i

. Then we search the records of mobile network to locate the mobile phones which have compatible trajectories with

ζ

i

. In this way, as long as the suspect is carrying a mobile phone when he is captured by surveillance cameras, we can identify his phone and trace him by locating the phone. In order to perform fast retrieval of trajectories, we propose a threaded tree structure to index the trajectories, and adopt a heuristics based query optimization algorithm to prune unnecessary data access. Extensive experiments based on real mobile phone trajectory data show that a suspect’s phone can be uniquely identified with high probability while he is captured by more than four cameras distributed in different cells of the mobile network. Furthermore, the experiments also indicate that our proposed algorithms can efficiently perform the search within 1 second in the trajectory dataset containing 104 million records.

Jianming Lv, Haibiao Lin, Can Yang, Zhiwen Yu, Yinghong Chen, Miaoyi Deng

Multi-Output Regression with Tag Correlation Analysis for Effective Image Tagging

Automatic image tagging is one of the most important research topics in multimedia. How to achieve accurate image tagging to bridge the semantic gap between images’ content and users’ semantic understanding has been widely studied in the last decade. One common approach is to convert image tagging to a multi-task learning problem. However, most existing methods ignore tag correlations in the learning process. In this paper, we show the importance of tag correlations in conducting multi-task learning. We formulate image tagging as a multi-output regression problem accounting for tag correlations, which are captured by the covariance matrix of the regression coefficients and the noise across all tags respectively. The combination of multi-output regression with tag correlation analysis takes advantage of the latent dependencies among tags to overcome limitations of existing work. Extensive experiments have been conducted on two benchmark datasets, and the results confirm that our approach outperforms the state-of-the-art methods.

Hongyun Cai, Zi Huang, Xiaofeng Zhu, Qing Zhang, Xuefei Li

The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection

With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.

Taufik Sutanto, Richi Nayak

Spatio-temporal Data Management

A Skylining Approach to Optimize Influence and Cost in Location Selection

Location-selection problem underlines many spatial decision-making applications. In this paper, we study an interesting location-selection problem which can find many applications such as banking outlet and hotel locations selections. In particular, given a number of spatial objects and a set of location candidates, we select some locations which maximize the influence but minimize the cost. The influence of a location is defined by the number of spatial objects within a given distance; and the cost of a location is indicated by the minimum payment for such location, which is measured by quality vectors. We show that a straightforward extension of a skyline approach is inefficient, as it needs to compute the influence and cost for all the location candidates relying on many expensive range queries. To overcome this weakness, we extend the Branch and Bound Skyline (BBS) method with a novel spatial join algorithm. We derive influence and cost bounds to prune irrelevant R-tree entries and to early confirm part of the final answers. Theoretical analysis and extensive experiments demonstrate the efficiency and scalability of our proposed algorithms.

Juwei Shi, Hua Lu, Jiaheng Lu, Chengxuan Liao

Geo-Social Skyline Queries

By leveraging the capabilities of modern GPS-equipped mobile devices providing social-networking services, the interest in developing advanced services that combine location-based services with social networking services is growing drastically. Based on geo-social networks that couple personal location information with personal social context information, such services are facilitated by geo-social queries that extract useful information combining social relationships and current locations of the users. In this paper, we tackle the problem of geo-social skyline queries, a problem that has not been addressed so far. Given a set of persons

D

connected in a social network SN with information about their current location, a geo-social skyline query reports for a given user

U

ε

D

and a given location

P

(not necessarily the location of the user) the pareto-optimal set of persons who are close to

P

and closely connected to

U

in SN. We measure the social connectivity between users using the widely adoted, but very expensive Random Walk with Restart method (RWR) to obtain the social distance between users in the social network. We propose an efficient solution by showing how the RWR-distance can be bounded efficiently and effectively in order to identify true hits and true drops early. Our experimental evaluation shows that our presented pruning techniques allow to vastly reduce the number of objects for which a more exact social distance has to be computed, by using our proposed bounds only.

Tobias Emrich, Maximilian Franzke, Nikos Mamoulis, Matthias Renz, Andreas Züfle

Reverse-Nearest Neighbor Queries on Uncertain Moving Object Trajectories

Reverse nearest neighbor (RNN) queries in spatial and spatio-temporal databases have received significant attention in the database research community over the last decade. A reverse nearest neighbor (RNN) query finds the objects having a given query object as its nearest neighbor. RNN queries find applications in data mining, marketing analysis, and decision making. Most previous research on RNN queries over trajectory databases assume that the data are certain. In realistic scenarios, however, trajectories are inherently uncertain due to measurement errors or time-discretized sampling. In this paper, we study RNN queries in databases of uncertain trajectories. We propose two types of RNN queries based on a well established model for uncertain spatial temporal data based on stochastic processes, namely the Markov model. To the best of our knowledge our work is the first to consider RNN queries on uncertain trajectory databases in accordance with the possible worlds semantics. We include an extensive experimental evaluation on both real and synthetic data sets to verify our theoretical results.

Tobias Emrich, Hans-Peter Kriegel, Nikos Mamoulis, Johannes Niedermayer, Matthias Renz, Andreas Züfle

Selectivity Estimation of Reverse k-Nearest Neighbor Queries

This paper explores different heuristics to estimate the selectivity of a reverse

k

-nearest neighbor query. The proposed methods approximate the number of results for a given R

k

NN query, providing a key ingredient of common (relational) query optimizers. A range of experiments evaluate the quality of these estimates compared to the true number of results on both real and synthetic data, analyzing the potentials of the proposed approximations to make accurate predictions that can be used to generate efficient query execution plans.

Michael Steinke, Johannes Niedermayer, Peer Kröger

Graph Data Management

Efficient Sampling Methods for Shortest Path Query over Uncertain Graphs

Graph has become a widely used structure to model data. Unfortunately, data are inherently with uncertainty because of the occurrence of noise and incompleteness in data collection. This is why uncertain graphs catch much attention of researchers. However, the uncertain graph models in existing works assume all edges in a graph are independent of each other, which dose not really make sense in real applications. Thus, we propose a new model for uncertain graphs considering the correlation among edges sharing the same vertex. Moreover, in this paper, we mainly solve the shortest path query, which is a funduemental but important query on graphs, using our new model. As the problem of calculating shortest path probability over correlated uncertain graphs is #P-hard, we propose different kinds of sampling methods to efficiently compute an approximate answer. The error is very small in our algorithm, which is proved and further verified in our experiments.

Yurong Cheng, Ye Yuan, Guoren Wang, Baiyou Qiao, Zhiqiong Wang

Exploiting Transitive Similarity and Temporal Dynamics for Similarity Search in Heterogeneous Information Networks

Heterogeneous information networks have attracted much attention in recent years and a key challenge is to compute the similarity between two objects. In this paper, we study the problem of similarity search in heterogeneous information networks, and extend the meta path-based similarity measure

PathSim

by incorporating richer information, such as transitive similarity and temporal dynamics. Experiments on a large DBLP network show that our improved similarity measure is more effective at identifying similar authors in terms of their future collaborations.

Jiazhen He, James Bailey, Rui Zhang

Top-k Similarity Matching in Large Graphs with Attributes

Graphs have been widely used in social networks to find interesting relationships between individuals. To mine the wealthy information in an attributed graph, effective and efficient graph matching methods are critical. However, due to the noisy and the incomplete nature of real graph data, approximate graph matching is essential. On the other hand, most users are only interested in the top-

k

similar matching, which proposed the problem of top-

k

similarity search in large attributed graphs. In this paper, we propose a novel technique to find top-

k

similar subgraphs. To prune unpromising data nodes effectively, our indexing structure is established based on the nodes degrees and their neighborhood connections. Then, a novel method combining graph structure and node attributes is used to calculate the similarity of matchings to find the top-

k

results. We integrate the adapted TA into the procedure to further enhance the similar graph search. Extensive experiments are performed on a social graph to evaluate the effectiveness and efficiency of our methods.

Xiaofeng Ding, Jianhong Jia, Jiuyong Li, Jixue Liu, Hai Jin

On Perspective-Aware Top-k Similarity Search in Multi-relational Networks

It is fundamental to compute the most “

similar

”

k

nodes w.r.t. a given query node in networks; it serves as primitive operator for tasks such as social recommendation, link prediction, and web searching. Existing approaches to this problem do not consider types of relationships (edges) between two nodes. However, in real networks there exist different kinds of relationships. These kinds of network are called multi-relational networks, in which, different relationships can be modeled by different graphs. From different perspectives, the relationships of the objects are reflected by these different graphs. Since the link-based similarity measure is determined by the structure of the corresponding graph, similarity scores among nodes of the same network are different w.r.t. different perspectives. In this paper, we propose a new type of query,

perspective-aware top-k similarity query

, to provide more insightful results for users. We efficiently obtain all top-

k

similar nodes to a given node simultaneously from all perspectives of the network. To accelerate the query processing, several optimization strategies are proposed. Our solutions are validated by performing extensive experiments.

Yinglong Zhang, Cuiping Li, Hong Chen, Likun Sheng

Security, Privacy and Trust

ρ-uncertainty Anonymization by Partial Suppression

We present a novel framework for set-valued data anonymization by partial suppression regardless of the amount of background knowledge the attacker possesses, and can be adapted to both space-time and quality-time trade-offs in a “pay-as-you-go” approach. While minimizing the number of item deletions, the framework attempts to either preserve the original data distribution or retain mineable useful association rules, which targets statistical analysis and association mining, two major data mining applications on set-valued data.

Xiao Jia, Chao Pan, Xinhui Xu, Kenny Q. Zhu, Eric Lo

Access Control for Data Integration in Presence of Data Dependencies

Defining access control policies in a data integration scenario is a challenging task. In such a scenario typically each source specifies its local access control policy and cannot anticipate data inferences that can arise when data is integrated at the mediator level. Inferences, e.g., using functional dependencies, can allow malicious users to obtain, at the mediator level, prohibited information by linking multiple queries and thus violating the local policies. In this paper, we propose a framework,

i.e.,

a methodology and a set of algorithms, to prevent such violations. First, we use a graph-based approach to identify sets of queries, called violating transactions, and then we propose an approach to forbid the execution of those transactions by identifying additional access control rules that should be added to the mediator. We also state the complexity of the algorithms and discuss a set of experiments we conducted by using both real and synthetic datasets. Tests also confirm the complexity and upper bounds in worst-case scenarios of the proposed algorithms.

Mehdi Haddad, Jovan Stevovic, Annamaria Chiasera, Yannis Velegrakis, Mohand-Saïd Hacid

Thwarting Passive Privacy Attacks in Collaborative Filtering

While recommender systems based on

collaborative filtering

have become an essential tool to help users access items of interest, it has been indicated that collaborative filtering enables an adversary to perform

passive privacy attacks

, a type of the most damaging and easy-to-perform privacy attacks. In a passive privacy attack, the dynamic nature of a recommender system allows an adversary with a moderate amount of background knowledge to infer a user’s transaction through temporal changes in the

public

related-item lists (RILs). Unlike the traditional solutions that manipulate the underlying user-item rating matrix, in this paper, we respond to passive privacy attacks by directly anonymizing the RILs, which are the real outputs rendered to an adversary. This fundamental switch allows us to provide a novel rigorous inference-proof privacy guarantee, known as

δ

-

bound

, with desirable data utility and scalability. We propose anonymization algorithms based on suppression and a novel mechanism,

permutation

, tailored to our problem. Experiments on real-life data demonstrate that our solutions are both effective and efficient.

Rui Chen, Min Xie, Laks V. S. Lakshmanan

Privacy-Preserving Schema Reuse

As the number of schema repositories grows rapidly and several webbased platforms exist to support publishing schemas,

schema reuse

becomes a new trend. Schema reuse is a methodology that allows users to create new schemas by copying and adapting existing ones. This methodology supports to reduce not only the effort of designing new schemas but also the heterogeneity between them. One of the biggest barriers of schema reuse is about privacy concerns that discourage schema owners from contributing their schemas. Addressing this problem, we develop a framework that enables privacy-preserving schema reuse. Our framework supports the contributors to define their own protection policies in the form of

privacy constraints

. Instead of showing original schemas, the framework returns an

anonymized schema

with maximal

utility

while satisfying these privacy constraints. To validate our approach, we empirically show the efficiency of different heuristics, the correctness of the proposed utility function, the computation time, as well as the trade-off between utility and privacy

Nguyen Quoc Viet Hung, Do Son Thanh, Nguyen Thanh Tam, Karl Aberer

Web and Social Data Management

Any Suggestions? Active Schema Support for Structuring Web Information

Backed up by major Web players schema.org is the latest broad initiative for structuring Web information. Unfortunately, a representative analysis on a corpus of 733 million Web documents shows that, a year after its introduction, only 1.56% of documents featured any schema.org annotations. A probable reason is that providing annotations is quite tiresome, hindering wide-spread adoption. Here even state-of-the-art tools like Google’s Structured Data Markup Helper offer only limited support. In this paper we propose SASS, a system for automatically finding high quality schema suggestions for page content, to ease the annotation process. SASS intelligently blends supervised machine learning techniques with simple user feedback. Moreover, additional support features for binding attributes to values even further reduces the necessary effort. We show that SASS is superior to current tools for schema.org annotations.

Silviu Homoceanu, Felix Geilert, Christian Pek, Wolf-Tilo Balke

ADI: Towards a Framework of App Developer Inspection

With the popularity of smart mobile devices, the amount of mobile applications (or simply called apps) has been increasing dramatically in recent years. However, due to low threshold to enter app industry, app developers vary significantly with respect to their expertise and reputation in the production of apps. Currently, there is no well-recognized objective and effective means to profile app developers. As the mobile market grows, it already gives rise to the problem of finding appropriate apps from the user point of view. In this paper, we propose a framework called

App Developer Inspector

(ADI), which aims to effectively profile app developers in aspects of their expertise and reputation in developing apps. ADI is essentially founded on two underlying models: the

App Developer Expertise

(ADE) model and the

App Developer Reputation

(ADR) model. In a nutshell, ADE is a generative model that derives the latent expertise for each developer and ADR is a model that exploits multiple features to evaluate app developers’ reputation. Using the app developer profiles generated in ADI, we study two new applications which respectively facilitate app search and app development outsourcing. We conduct extensive experiments on a large real world dataset to evaluate the performance of ADI. The results of experiments demonstrate the effectiveness of ADI in profiling app developers as well as its boosting impact on the new applications.

Kai Xing, Di Jiang, Wilfred Ng, Xiaotian Hao

Novel Community Recommendation Based on a User-Community Total Relation

With the exponential increase of Web communities, community recommendation has become increasingly important in sifting valuable and interesting communities for users. In this paper, we study the problem of novel community recommendation, and propose a method based on a user-community total relation (i.e. user-user, community-community, and user-community interactions). Our novel recommendation method suggests communities that the target user has not seen but is potentially interested in, in order to broaden the user’s horizon. Specifically, a

W

eighted

L

atent

D

irichlet

A

llocation (WLDA) algorithm improves recommendation accuracy utilizing social relations. A definition of community novelty together with an algorithm for novelty computation are further proposed based on the total relation. Finally, a multi-objective optimization strategy improves the overall recommendation quality by combining accuracy and novelty scores. Experimental results on a real dataset show that our proposed method outperforms state-of-the-art recommendation methods on both accuracy and novelty.

Qian Yu, Zhiyong Peng, Liang Hong, Bin Liu, Haiping Peng

User Interaction Based Community Detection in Online Social Networks

Discovering meaningful communities based on the interactions of different people in online social networks (OSNs) is an active research topic in recent years. However, existing interaction based community detection techniques either rely on the content analysis or only consider underlying structure of the social network graph, while identifying communities in OSNs. As a result, these approaches fail to identify active

communities

, i.e., communities based on actual interactions rather than mere friendship. To alleviate the limitations of existing approaches, we propose a novel solution of community detection in OSNs. The key idea of our approach comes from the following observations: (i) the degree of interaction between each pair of users can widely vary, which we term as

the strength of ties

, and (ii) for each pair of users, the interactions with mutual friends, which we term the

group behavior

, play an important role to determine their belongingness to the same community. Based on these two observations, we propose an efficient solution to detect communities in OSNs. The detailed experimental study shows that our proposed algorithm significantly outperforms state-of-the-art techniques for both real and synthetic datasets

Himel Dev, Mohammed Eunus Ali, Tanzima Hashem

Keyword Search

Object Semantics for XML Keyword Search

It is well known that some XML elements correspond to objects (in the sense of object-orientation) and others do not. The question we consider in this paper is what benefits we can derive from paying attention to such object semantics, particularly for the problem of keyword queries. Keyword queries against XML data have been studied extensively in recent years, with several lowest-common-ancestor based schemes proposed for this purpose, including SLCA, MLCA, VLCA, and ELCA. It can be seen that identifying objects can help these techniques return more meaningful answers than just the LCA node (or subtree) by returning objects instead of nodes. It is more interesting to see that object semantics can also be used to benefit the search itself. For this purpose, we introduce a novel Nearest Common Object Node semantics (NCON), which includes not just common object ancestors but also common object descendants. We have developed

XRich

, a system for our NCON-based approach, and used it in our extensive experimental evaluation. The experimental results show that our proposed approach outperforms the state-of-the-art approaches in terms of both effectiveness and efficiency.

Thuy Ngoc Le, Tok Wang Ling, H. V. Jagadish, Jiaheng Lu

Large-Scale Similarity Join with Edit-Distance Constraints

In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on MapReduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.

Chen Lin, Haiyang Yu, Wei Weng, Xianmang He

Topical Presentation of Search Results on Database

Clustering and faceting are two ways of presenting search results in database. Clustering shows the summary of the answer space by grouping similar results. However, clusters are not self-explanatory, thus users cannot clearly identify what can be found inside each cluster. On the other hand, faceting groups results by labelling, but there might be too many facets that overwhelm users.

In this paper, we propose a novel approach, topical presentation, to better present the search results. We reckon that an effective presentation technique should be able to cluster results into reasonable number of groups with intelligible meaning, and provide as much information as possible on the first screen. We define and study the presentation properties first, and then propose efficient algorithms to provide real time presentation. Extensive experiments on real datasets show the effectiveness and efficiency of the proposed method.

Hao Hu, Mingxi Zhang, Zhenying He, Peng Wang, Wei Wang, Chengfei Liu

An Efficient Approach for Mining Top-k High Utility Specialized Query Expansions on Social Tagging Systems

A specialized query expansion consists of a set of keywords, which is used to reduce the size of search results in order to help users find the required data conveniently. The utility of a specialized query expansion represents the qualities of the top-

N

high quality objects matching the expansion. Given the search results of a keyword query on social tagging systems, we want to find

k

specialized query expansions with the highest utilities without redundancy. Besides, the discovered expansions are guaranteed to match at least N objects. We construct a tree structure, called an

UT

-tree, to maintain the tag sets appearing in the search results for generating the specialized query expansions. We first propose a depth-first approach to find the top-

k

high utility specialized query expansions from the

UT

-tree. For further speeding up this basic approach, we exploit the lower bound and upper bound estimations of utilities for a specialized query expansion to reduce the size of the constructed

UT

-tree. Only the tag sets of objects which are possibly decide the top-k high utility specialized query expansions need to be accessed and maintained. By applying this strategy, we propose another faster algorithm. The experiment results demonstrate that the proposed algorithms work well on both the effectiveness and the efficiency.

Jia-Ling Koh, I-Chih Chiu

Data Stream Management

Novel Techniques to Reduce Search Space in Periodic-Frequent Pattern Mining

Periodic-frequent patterns are an important class of regularities that exist in a transactional database. Informally, a frequent pattern is said to be periodic-frequent if it appears at a regular interval specified by the user (i.e., periodically) in a database. A pattern-growth algorithm, called PFP-growth, has been proposed in the literature to discover the patterns. This algorithm constructs a

tid

-list for a pattern and performs a complete search on the

tid

-list to determine whether the corresponding pattern is a periodic-frequent or a non-periodic-frequent pattern. In very large databases, the

tid

-list of a pattern can be very long. As a result, the task of performing a complete search over a pattern’s

tid

-list can make the pattern mining a computationally expensive process. In this paper, we have made an effort to reduce the computational cost of mining the patterns. In particular, we apply greedy search on a pattern’s

tid

-list to determine the periodic interestingness of a pattern. The usage of greedy search facilitate us to prune the non-periodic-frequent patterns with a sub-optimal solution, while finds the periodic-frequent patterns with the global optimal solution. Thus, reducing the computational cost of mining the patterns without missing any knowledge pertaining to the periodic-frequent patterns. We introduce two novel pruning techniques, and extend them to improve the performance of PFP-growth. We call the algorithm as PFP-growth++. Experimental results show that PFP-growth++ is runtime efficient and highly scalable as well.

R. Uday Kiran, Masaru Kitsuregawa

Inferring Road Type in Crowdsourced Map Services

In crowdsourced map services, digital maps are created and updated manually by volunteered users. Existing service providers usually provide users with a feature-rich map editor to add, drop, and modify roads. To make the map data more useful for widely-used applications such as navigation systems and travel planning services, it is important to provide not only the topology of the road network and the shapes of the roads, but also the types of each road segment (e.g., highway, regular road, secondary way, etc.). To reduce the cost of manual map editing, it is desirable to generate proper recommendations for users to choose from or conduct further modifications. There are several recent works aimed at generating road shapes from large number of historical trajectories; while to the best of our knowledge, none of the existing works have addressed the problem of inferring road types from historical trajectories. In this paper, we propose a model-based approach to infer road types from taxis trajectories. We use a combined inference method based on stacked generalization, taking into account both the topology of the road network and the historical trajectories. The experiment results show that our approach can generate quality recommendations of road types for users to choose from.

Ye Ding, Jiangchuan Zheng, Haoyu Tan, Wuman Luo, Lionel M. Ni

Rights Protection for Trajectory Streams

More and more trajectory data are available as streams due to the unprecedented prevalence of mobile positioning devices. Meanwhile, an increasing number of applications are designed to be dependent on real-time trajectory streams. Therefore, the protection of ownership rights over such data becomes a necessity. In this paper, we propose an online watermarking scheme that can be used for the rights protection of trajectory streams. The scheme works in a finite window, single-pass streaming model. It embeds watermark by modifying feature distances extracted from the streams. The fact that these feature distances can be recovered ensures a consistent overlap between the recovered watermark and the embedded one. Experimental results verify the robustness of the scheme against domain-specific attacks, including geometric transformations, noise addition, trajectory segmentation and compression.

Mingliang Yue, Zhiyong Peng, Kai Zheng, Yuwei Peng

Efficient Detection of Emergency Event from Moving Object Data Streams

The advance of positioning technology enables us to online collect moving object data streams for many applications. One of the most significant applications is to detect emergency event through observed abnormal behavior of objects for disaster prediction. However, the continuously generated moving object data streams are often accumulated to a massive dataset in a few seconds and thus challenge existing data analysis techniques. In this paper, we model a process of emergency event forming as a process of rolling a snowball, that is, we compare a size-rapidly-changed (e.g., increased or decreased) group of moving objects to a snowball. Thus, the problem of emergency event detection can be resolved by snowball discovery. Then, we provide two algorithms to find snowballs: a clustering-and-scanning algorithm with the time complexity of

O(n

2

) and an efficient adjacency-list-based algorithm with the time complexity of

O

(

nlogn

). The second method adopts adjacency lists to optimize efficiency. Experiments on both real-world dataset and large synthetic datasets demonstrate the effectiveness, precision and efficiency of our algorithms

Limin Guo, Guangyan Huang, Zhiming Ding

Data Quality

Cost Reduction for Web-Based Data Imputation

Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy.

Zhixu Li, Shuo Shang, Qing Xie, Xiangliang Zhang

Incremental Quality Inference in Crowdsourcing

Crowdsourcing has attracted significant attention from the database community in recent years and several crowdsourced databases have been proposed to incorporate human power into traditional database systems. One big issue in crowdsourcing is to achieve high quality because workers may return incorrect answers. A typical solution to address this problem is to assign each question to multiple workers and combine workers’ answers to generate the final result. One big challenge arising in this strategy is to infer worker’s quality. Existing methods usually assume each worker has a fixed quality and compute the quality using qualification tests or historical performance. However these methods cannot accurately estimate a worker’s quality. To address this problem, we propose a worker model and devise an incremental inference strategy to accurately compute the workers’ quality. We also propose a question model and develop two efficient strategies to combine the worker’s model to compute the question’s result. We implement our method and compare with existing inference approaches on real crowdsourcing platforms using real-world datasets, and the experiments indicate that our method achieves high accuracy and outperforms existing approaches.

Jianhong Feng, Guoliang Li, Henan Wang, Jianhua Feng

Repair Diversification for Functional Dependency Violations

In practice, data are often found to violate functional dependencies, and are hence inconsistent. To resolve such violations, data are to be restored to a consistent state, known as “repair”, while the number of possible repairs may be exponential. Previous works either consider optimal repair computation, to find one single repair that is (nearly) optimal

w.r.t.

some cost model, or discuss repair sampling, to randomly generate a repair from the space of all possible repairs.

This paper makes a first effort to investigate repair diversification problem, which aims at generating a set of repairs by minimizing their costs and maximizing their diversity. There are several motivating scenarios where diversifying repairs is desirable. For example, in the recently proposed interactive repairing approach, repair diversification techniques can be employed to generate some representative repairs that are likely to occur (small cost), and at the same time, that are dissimilar to each other (high diversity).Repair diversification significantly differs from optimal repair computing and repair sampling in its framework and techniques. (1) Based on two natural diversification objectives, we formulate two versions of repair diversification problem, both modeled as bi-criteria optimization problem, and prove the complexity of their related decision problems. (2) We develop algorithms for diversification problems. These algorithms embed repair computation into the framework of diversification, and hence find desirable repairs without searching the whole repair space. (3) We conduct extensive performance studies, to verify the effectiveness and efficiency of our algorithms.

Chu He, Zijing Tan, Qing Chen, Chaofeng Sha, Zhihui Wang, Wei Wang

Industrial Papers

BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework

Big Data is considered proprietary asset of companies, organizations, and even nations. Turning big data into real treasure requires the support of big data systems. A variety of commercial and open source products have been unleashed for big data storage and processing. While big data users are facing the choice of which system best suits their needs, big data system developers are facing the question of how to evaluate their systems with regard to general big data processing needs. System benchmarking is the classic way of meeting the above demands. However, existent big data benchmarks either fail to represent the variety of big data processing requirements, or target only one specific platform, e.g. Hadoop.

In this paper, with our industrial partners, we present Big

OP

, an end-to-end system benchmarking framework, featuring the abstraction of representative

O

peration sets, workload

P

atterns, and prescribed tests. BigOP is part of an open-source big data benchmarking project,

BigDataBench

. BigOP’s abstraction model not only guides the development of BigDataBench, but also enables automatic generation of tests with comprehensive workloads.

We illustrate the feasibility of BigOP by implementing an automatic test generation tool and benchmarking against three widely used big data processing systems, i.e. Hadoop, Spark and MySQL Cluster. Three tests targeting three different application scenarios are prescribed. The tests involve relational data, text data and graph data, as well as all operations and workload patterns. We report results following test specifications.

Yuqing Zhu, Jianfeng Zhan, Chuliang Weng, Raghunath Nambiar, Jinchao Zhang, Xingzhen Chen, Lei Wang

A*DAX: A Platform for Cross-Domain Data Linking, Sharing and Analytics

We introduce the A*STAR Data Analytics and Exchange Platform (“A*DAX”), which is the backbone data platform for different programs and projects under the Urban Systems Initiative launched by the Agency for Science, Technology and Research in Singapore. The A*DAX aims to provide a centralized system for public and private sectors to manage and share data; meanwhile, it also provides basic data analytics and visualization functions for authorized parties to consume data. A*DAX is also a channel for developers to develop innovative applications based on real data to improve urban services.

In this paper, we focus on presenting the platform components that address challenges in data integration and processing. In particular, the A*DAX platform needs to dynamically fuse heterogeneous data from unpredictable sources, which makes traditional data integration mechanisms hard to be applied. Also, the platform needs to process data with different dynamics (i.e., database vs. data stream) in large scale. In our design, we use a semantic approach to handle data fusion and integration problems, and propose a hybrid architecture to process static and dynamic data to answer queries at the same time. Other issues about the A*DAX platform, e.g., security and privacy, are not covered in this paper.

Narayanan Amudha, Gim Guan Chua, Eric Siew Khuan Foo, Shen Tat Goh, Shuqiao Guo, Paul Min Chim Lim, Mun-Thye Mak, Muhammad Cassim Mahmud Munshi, See-Kiong Ng, Wee Siong Ng, Huayu Wu

Optimizing Database Load and Extract for Big Data Era

With growing and pervasive interest in Big Data, SQL relational databases need to compete with data management by Hadoop, NoSQL and NoDB. Database research has mainly focused on result generation by query processing. But SQL databases require data in-place before queries may be processed. The process of DB loading has been a bottleneck leading to external ETL/ELT techniques for loading large data sets. This paper focuses on DB engine level techniques for optimizing both data loads and extracts in an MPP, shared-nothing SQL database,

dbX

, available on in-house commodity hardware and cloud systems. The agile, data loading of dbX exploits parallelism at multiple levels to achieve TBs of data load per hour making it suitable for cloud and continuous actionable knowledge applications. Implementation techniques at DB engine level, extensions to load/extract syntax and performance results are presented. Load optimization techniques help to speed up data extract to flat files and CTAS type SQL queries too. We show linear scale up with cluster scale out for load/extract in public cloud and commodity hardware systems without recourse to database tuning or use of expensive database appliances.

K. T. Sridhar, M. A. Sakkeer

Web Page Centered Communication System Based on a Physical Property

We have developed a novel communication system that enables

ALL

users to share information and emotion effectively through

ALL

Web pages based on a physical property: a user can perceive others browsing the same pages through avatars on the Web browser, like a face to face communication in real world. The system optimize the displayed and communicable users by the relationship among them based on the information searching and sharing activities. Furthermore, the system also enables users to search not only Web pages but appropriate users: we construct a new ranking model by combining the real space information (the user activities) with the Web space information (the hyperlinks). We also verify the effectiveness of the system by the user study experiments in the condition where we suppose the system works especially well, i.e., information retrieval, collaboration tasks, and sharing of feelings.

Yuhki Shiraishi, Yukiko Kawai, Jianwei Zhang, Toyokazu Akiyama

Demo Papers

TaxiHailer: A Situation-Specific Taxi Pick-Up Points Recommendation System

This demonstration presents TaxiHailer, a situation-specific recommendation system for passengers who are eager to find a taxi. Given a query with departure point, destination and time, it recommends pick-up points within a specified distance and ranked by potential waiting time. Unlike existing works, we consider three sets of features to build regression models, as well as Poisson process models for road segment clusters. We evaluate and choose the most proper models for each cluster under different situations. Also, TaxiHailer gives destination-aware recommendations for pick-up points with driving directions. We evaluate our recommendation results based on real GPS datasets.

Leyi Song, Chengyu Wang, Xiaoyi Duan, Bing Xiao, Xiao Liu, Rong Zhang, Xiaofeng He, Xueqing Gong

Camel: A Journey Group T-Pattern Mining System Based on Instagram Trajectory Data

A Journey Group T-Pattern (JG T-Pattern) is a special kind of T-Pattern(Trajectory Pattern) in which a large number of users walked through a common trajectory; also, it allows users depart from the trajectory for several times. Travel route is an instance of Journey Group and hot travel route can then be mined under the help of Camel. Instagram is a popular photo-sharing smart phone application based on social network, it is widely used among tourists to record their journey. In this paper, we focus on data generated by Instagram to discover the JG T-pattern of travel routes. Previous researches on T-pattern mining focus on GPS-based data, which is different from the UGC-based(User Generated Content based) data. Data of the former is dense because it is often generated automatically in a certain pace, while the latter is sparse because it is UGC-based, which means the data is generated by the uploading of users. Therefore, a novel approach, called Journey Group T-pattern Mining strategy, is proposed to deal with the trajectory mining on sparse location data. The demo shows that Camel is an efficient and effective system to discover Journey Groups.

Yaxin Yu, Xudong Huang, Xinhua Zhu, Guoren Wang

Harbinger: An Analyzing and Predicting System for Online Social Network Users’ Behavior

Online Social Network (OSN) is one of the hottest innovations in the past years. For OSN, users’ behavior is one of the important factors to study. This demonstration proposal presents

Harbinger

, an analyzing and predicting system for OSN users’ behavior. In

Harbinger

, we focus on tweets’ timestamps (when users post or share messages), visualize users’ post behavior as well as message retweet number and build adjustable models to predict users’ behavior. Predictions of users’ behavior can be performed with the established behavior models and the results can be applied to many applications such as tweet crawlers and advertisements.

Rui Guo, Hongzhi Wang, Lucheng Zhong, Jianzhong Li, Hong Gao

Cloud-Scale Transaction Processing with ParaDB System: A Demonstration

Scalability, flexibility, fault-tolerance and self-manageability are desirable features for data management in the cloud. This paper demonstrates ParaDB, a cloud-scale parallel relational database system optimized for intensive transaction processing. ParaDB satisfies the aforementioned four features without sacrificing the ACID transactional requirements. ParaDB is designed to break the petabyte or exabyte barrier and scale out to many thousands of servers while providing transactional support with strong consistency

Xiaoyan Guo, Yu Cao, Baoyao Zhou, Dong Xiang, Liyuan Zhao

BSMA-Gen: A Parallel Synthetic Data Generator for Social Media Timeline Structures

A synthetic social media data generator, namely BSMA-

Gen

is introduced in this demonstration. It can parallelly generate timeline structures of social media. The data generator is part of BSMA, a benchmark for analytical queries over social media data. Both its internal process and generated data are to be shown in the demonstration.

Chengcheng Yu, Fan Xia, Qunyan Zhang, Haixin Ma, Weining Qian, Minqi Zhou, Cheqing Jin, Aoying Zhou

A Mobile Log Data Analysis System Based on Multidimensional Data Visualization

The log data collected from mobile telecom services contains plenty of valuable information. The critical technical challenges to mobile log analysis include how to extract information from unformatted raw data, as well as how to visually represent and interact with the analysis results. In this paper, we demonstrate MobiLogViz, which seamlessly combines multidimensional data visualization and web usage mining techniques. By automatically processing the log data and providing coordinated views with various interactions, MobiLogViz effectively aids users in analyzing and exploring mobile log data.

Ting Liang, Yu Cao, Min Zhu, Baoyao Zhou, Mingzhao Li, Qihong Gan

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Data Mining

Ensemble Pruning: A Submodular Function Maximization Perspective

Identify and Trace Criminal Suspects in the Crowd Aided by Fast Trajectories Retrieval

Multi-Output Regression with Tag Correlation Analysis for Effective Image Tagging

The Ranking Based Constrained Document Clustering Method and Its Application to Social Event Detection

Spatio-temporal Data Management

A Skylining Approach to Optimize Influence and Cost in Location Selection

Geo-Social Skyline Queries

Reverse-Nearest Neighbor Queries on Uncertain Moving Object Trajectories

Selectivity Estimation of Reverse k-Nearest Neighbor Queries

Graph Data Management

Efficient Sampling Methods for Shortest Path Query over Uncertain Graphs

Exploiting Transitive Similarity and Temporal Dynamics for Similarity Search in Heterogeneous Information Networks

Top-k Similarity Matching in Large Graphs with Attributes

On Perspective-Aware Top-k Similarity Search in Multi-relational Networks

Security, Privacy and Trust

ρ-uncertainty Anonymization by Partial Suppression

Access Control for Data Integration in Presence of Data Dependencies

Thwarting Passive Privacy Attacks in Collaborative Filtering

Privacy-Preserving Schema Reuse

Web and Social Data Management

Any Suggestions? Active Schema Support for Structuring Web Information

ADI: Towards a Framework of App Developer Inspection

Novel Community Recommendation Based on a User-Community Total Relation

User Interaction Based Community Detection in Online Social Networks

Keyword Search

Object Semantics for XML Keyword Search

Large-Scale Similarity Join with Edit-Distance Constraints

Topical Presentation of Search Results on Database

An Efficient Approach for Mining Top-k High Utility Specialized Query Expansions on Social Tagging Systems

Data Stream Management

Novel Techniques to Reduce Search Space in Periodic-Frequent Pattern Mining

Inferring Road Type in Crowdsourced Map Services

Rights Protection for Trajectory Streams

Efficient Detection of Emergency Event from Moving Object Data Streams

Data Quality

Cost Reduction for Web-Based Data Imputation

Incremental Quality Inference in Crowdsourcing

Repair Diversification for Functional Dependency Violations

Industrial Papers

BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework

A*DAX: A Platform for Cross-Domain Data Linking, Sharing and Analytics

Optimizing Database Load and Extract for Big Data Era

Web Page Centered Communication System Based on a Physical Property

Demo Papers

TaxiHailer: A Situation-Specific Taxi Pick-Up Points Recommendation System

Camel: A Journey Group T-Pattern Mining System Based on Instagram Trajectory Data

Harbinger: An Analyzing and Predicting System for Online Social Network Users’ Behavior

Cloud-Scale Transaction Processing with ParaDB System: A Demonstration

BSMA-Gen: A Parallel Synthetic Data Generator for Social Media Timeline Structures

A Mobile Log Data Analysis System Based on Multidimensional Data Visualization

Backmatter

Premium Partner