Graph Data Management I

Shortest Path Computation over Disk-Resident Large Graphs Based on Extended Bulk Synchronous Parallel Methods

The Single Source Shortest Path (SSSP) computation over large graphs has raised significant challenges to the memory capacity and processing efficiency. Utilizing disk-based parallel iterative computing is an economic solution. However, costs of disk I/O and communication affect the performance heavily. This paper proposes a state-transition model for SSSP and then designs two optimization strategies based on it. First, we introduce a tunable hash index to reduce the scale of

wasteful

data

loaded from the disk. Second, we propose a new iterative mechanism and design an Across-step Message Pruning (ASMP) policy to deal with the communication bottleneck. The experimental results illustrate that our SSSP computation is 2 times faster than a basic Giraph (a memory-resident parallel framework) implementation. Compared with Hadoop and Hama (disk-resident parallel frameworks), the speedup is 21 to 43.

Zhigang Wang, Yu Gu, Roger Zimmermann, Ge Yu

Fast SimRank Computation over Disk-Resident Graphs

There are many real-world applications based on similarity between objects, such as clustering, similarity query processing, information retrieval and recommendation systems. SimRank is a promising measure of similarity based on random surfers model. However, the computational complexity of SimRank is high and several optimization techniques have been proposed. In the paper optimization issue of SimRank computation in disk-resident graphs is our primary focus. First we suggest a new approach to compute SimRank.Then we propose optimization techniques that improve the time cost of the new approach from O (

kN

2

D

2

) to

O

(

kNL

), where k is the number of iteration, N is the number of nodes, L is the number of edges, and D is the average degree of nodes. Meanwhile, a threshold sieving method is presented to reduce storage and computational cost. On this basis, an external algorithm computing SimRank in disk-resident graphs is introduced. In the experiments, our algorithm outperforms its opponent whose computation complexity also is

O

(

kNL

).

Yinglong Zhang, Cuiping Li, Hong Chen, Likun Sheng

S-store: An Engine for Large RDF Graph Integrating Spatial Information

The semantic web data and the SPARQL query language allow users to write precise queries. However, the lack of spatial information limits the use of the semantic web data on position-oriented query. In this paper, we introduce spatial SPARQL, a variant of SPARQL language, for querying spatial information integrated RDF data. Besides, we design a novel index SS-tree for evaluating the spatial queries. Based on the index, we propose a search algorithm. The experimental results show the effectiveness and the efficiency of our approach.

Dong Wang, Lei Zou, Yansong Feng, Xuchuan Shen, Jilei Tian, Dongyan Zhao

Physical Design

Physical Column Organization in In-Memory Column Stores

Cost models are an essential part of database systems, as they are the basis of query performance optimization. Disk based systems are well understood and sophisticated models exist to compare various data structures and to estimate query costs based on disk IO operations. Cost models for in-memory databases shift the focus from disk IOs to main memory accesses and CPU costs. However, modeling memory accesses is fundamentally different and common models do not apply anymore.

In this work, we examine the plan operations scan with equality selection, scan with range selection, positional lookup and insert in in-memory column stores regarding different physical column organizations. We consider uncompressed columns, bit compressed and dictionary encoded columns with sorted and unsorted dictionaries. Furthermore, we discuss tree indices on columns and dictionaries and present a detailed parameter evaluation, considering the number of distinct values, value skewness and value disorder. Finally, we present and evaluate a cost model based on cache misses for estimating the runtime of the discussed plan operations.

David Schwalb, Martin Faust, Jens Krueger, Hasso Plattner

Semantic Data Warehouse Design: From ETL to Deployment à la Carte

In last decades, semantic databases (

$\mathcal{S}\mathcal{D}\mathcal{B}$

) emerge and become operational databases, since the major vendors provide semantic supports in their products. This is mainly due to the spectacular development of ontologies in several domains like E-commerce, Engineering, Medicine, etc. Contrary to a traditional database, where its tuples are stored in a relational (table) layout, a

$\mathcal{S}\mathcal{D}\mathcal{B}$

stores independently ontology and its instances in one of the three main storage layouts (horizontal, vertical, binary). Based on this situation,

$\mathcal{S}\mathcal{D}\mathcal{B}$

become serious candidates for business intelligence projects built around the Data Warehouse (

$\mathcal{D}\mathcal{W}$

) technology. The important steps of the

$\mathcal{D}\mathcal{W}$

development life-cycle (user requirement analysis, conceptual design, logical design, ETL, physical design) are usually dealt in isolation way. This is mainly due to the complexity of each phase. Actually, the

$\mathcal{D}\mathcal{W}$

technology is quite mature for the traditional data sources. As a consequence, leveraging its steps to deal with semantic

$\mathcal{D}\mathcal{W}$

becomes a necessity. In this paper, we propose a methodology covering the most important steps of life-cycle of semantic

$\mathcal{D}\mathcal{W}$

. Firstly, a mathematical formalization of ontologies,

$\mathcal{S}\mathcal{D}\mathcal{B}$

and semantic

$\mathcal{D}\mathcal{W}$

is given. User requirements are expressed on the ontological level by the means of the goal oriented paradigm. Secondly, the ETL process is expressed on the ontological level, independently of any implementation constraint. Thirdly, different deployment solutions according to the storage layouts are proposed and implemented using the data access object design patterns. Finally, a prototype validating our proposal using the Lehigh University Benchmark ontology is given.

Ladjel Bellatreche, Selma Khouri, Nabila Berkani

A Specific Encryption Solution for Data Warehouses

Protecting Data Warehouses (DWs) is critical, because they store the secrets of the business. Although published work state encryption is the best way to assure the confidentiality of sensitive data and maintain high performance, this adds overheads that jeopardize their feasibility in DWs. In this paper, we propose a Specific Encryption Solution tailored for DWs (SES-DW), using a numerical cipher with variable mixes of eXclusive Or (XOR) and modulo operators. Storage overhead is avoided by preserving each encrypted column’s datatype, while transparent SQL rewriting is used to avoid I/O and network bandwidth bottlenecks by discarding data roundtrips for encryption and decryption purposes. The experimental evaluation using the TPC-H benchmark and a real-world sales DW with Oracle 11g and Microsoft SQL Server 2008 shows that SES-DW achieves better response time in both inserting and querying, than standard and state-of-the-art encryption algorithms such as AES, 3DES, OPES and Salsa20, while providing considerable security strength.

Ricardo Jorge Santos, Deolinda Rasteiro, Jorge Bernardino, Marco Vieira

NameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System

Current works on power-proportional distributed file systems have not considered the cost of updating data sets that were modified (updated or appended) in a low-power mode, where a subset of nodes were powered off. Effectively reflecting the updated data is vital in making a distributed file system, such as the Hadoop Distributed File System (HDFS), power proportional. This paper presents a novel architecture, a NameNode and DataNode Coupling Hadoop Distributed File System (NDCouplingHDFS), which effectively reflects the updated blocks when the system goes into a high-power mode. This is achieved by coupling the metadata management and data management at each node to efficiently localize the range of blocks maintained by the metadata. Experiments using actual machines show that NDCouplingHDFS is able to significantly reduce the execution time required to move updated blocks by 46% relative to the normal HDFS. Moreover, NDCouplingHDFS is capable of increasing the throughput of the system that is supporting MapReduce by applying an index in metadata management.

Hieu Hanh Le, Satoshi Hikida, Haruo Yokota

Knowledge Management

Mapping Entity-Attribute Web Tables to Web-Scale Knowledge Bases

There are many entity-attribute tables on the Web that can be utilized for enriching the entities of knowledge bases (KBs). This requires the schema mapping (matching) between the Web tables and the huge KBs. Existing solutions on schema mapping are inadequate for mapping a Web table and a KB, because of many reasons such as (1) there are many duplicates of entities and their types in a KB; (2) the schema of KB is often implicit, informal, and evolving over time; (3) the KB is typically very large in volume. In this paper, we propose a pure instance-based schema mapping solution to statistically find the effective mapping between a Web table and a KB via the matched data examples. Besides, we propose efficient solutions on finding the matched data examples as well as the overall mapping of a table and a KB. Experiments over real data sets show that our solution is much more accurate than the two baselines of existing solutions. Results also show that our solution is feasible for the mapping of Web tables to large scale KBs.

Xiaolu Zhang, Yueguo Chen, Jinchuan Chen, Xiaoyong Du, Lei Zou

ServiceBase: A Programming Knowledge-Base for Service Oriented Development

In recent times we have witnessed several advances in modern web-technology that has transformed the Internet into a global deployment and development platform. Such advances include Web 2.0 for large-scale collaboration; Social-computing for increased awareness; as well as Cloud-computing, which have helped virtualized resources over the Internet. As a result, this new computing environment has thus presented developers with ubiquitous access to countless web-services, along with computing resources, data-resources and tools. However, while these web-services enable tremendous automation and re-use opportunities, new productivity challenges have also emerged: The same repetitive, error-prone and time consuming integration work needs to get done each time a developer integrates a new API. To address these challenges we have developed

ServiceBase

, a "programming" knowledge-base, where common service-related low-level logic can be abstracted, organized, incrementally curated and thereby re-used by other application-developers. A framework is also proposed for decomposing and mapping raw service-messages into more common data-constructs, thus making interpreting, manipulating and chaining services further simplified despite their underlying heterogeneity. More so, empowered by this knowledge, we expose a set of APIs to simplify the way web-services can be used in application-development.

Moshe Chai Barukh, Boualem Benatallah

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

As the number of publicly-available datasets are likely to grow, the demand of establishing the links between these datasets is also getting higher and higher. For creating such links we need to match their schemas. Moreover, for using these datasets in meaningful ways, one often needs to match not only two, but several schemas. This matching process establishes a (potentially large) set of attribute correspondences between multiple schemas that constitute a

schema matching network

. Various commercial and academic schema matching tools have been developed to support this task. However, as the matching is inherently uncertain, the heuristic techniques adopted by these tools give rise to results that are not completely correct. Thus, in practice, a post-matching human expert effort is needed to obtain a correct set of attribute correspondences.

Addressing this problem, our paper demonstrates how to leverage crowdsourcing techniques to validate the generated correspondences. We design validation questions with contextual information that can effectively guide the crowd workers. We analyze how to reduce overall human effort needed for this validation task. Through theoretical and empirical results, we show that by harnessing natural constraints defined on top of the schema matching network, one can significantly reduce the necessary human work.

Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Zoltán Miklós, Karl Aberer

MFSV: A Truthfulness Determination Approach for Fact Statements

How to determine the truthfulness of a piece of information becomes an increasingly urgent need for users. In this paper, we propose a method called

MFSV

, to determine the truthfulness of

fact statements

. We first calculate the similarity between a piece of related information and the target fact statement and capture the credibility ranking of the related information through combining

importance ranking

and

popularity ranking

. Based on these, contributions of a piece of related information to the truthfulness determination is derived. Then we propose two methods to determine the truthfulness of the target fact statement. At last, we run comprehensive experiments to show

MFSV

’s availability and high accuracy.

Teng Wang, Qing Zhu, Shan Wang

Temporal Data Management

A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters

Replication, the widely adapted technique for crash fault tolerance introduces additional infrastructural costs for resource limited clusters. In this paper we take a different approach for maintaining stream program performance during crash failures. It is based on the concepts of automatic code generation. Albatross, the middleware we introduce for this task maintains the same performance level during crash failures based on predetermined priority values assigned to each stream program. Albatross constructs different versions of the input stream programs (sample programs) with different levels of performance characteristics, and assigns the best performing programs for normal operations. During node failure or node recovery, potential use of a different version of sample program is evaluated in order to bring the performance of each job back to its original level. We evaluated effectiveness of this approach with three different real world stream computing applications on System S distributed stream processing platform. We show that our approach is capable of maintaining stream program performance even if half of the nodes of the cluster has been crashed using both Apnoea, and Regex applications.

Miyuru Dayarathna, Toyotaro Suzumura

Event Relationship Analysis for Temporal Event Search

There are many news articles about events reported on the Web daily, and people are getting more and more used to reading news articles online to know and understand what events happened. For an event, (which may consist of several component events, i.e., episodes), people are often interested in the whole picture of its evolution and development along a time line. This calls for modeling the dependent relationships between component events. Further, people may also be interested in component events which play important roles in the event evolution or development. To satisfy the user needs in finding and understanding the whole picture of an event effectively and efficiently, we formalize in this paper the problem of temporal event search and propose a framework of event relationship analysis for search events based on user queries. We define three kinds of event relationships which are temporal relationship, content dependence relationship, and event reference relationship for identifying to what an extent a component event is dependent on another component event in the evolution of a target event (i.e., query event). Experiments conducted on a real data set show that our method outperforms a number of baseline methods.

Yi Cai, Qing Li, Haoran Xie, Tao Wang, Huaqing Min

Social Networks II

Dynamic Label Propagation in Social Networks

Label propagation has been studied for many years, starting from a set of nodes with labels and then propagating to those without labels. In social networks, building complete user profiles like interests and affiliations contributes to the systems like link prediction, personalized feeding, etc. Since the labels for each user are mostly not filled, we often employ some people to label these users. And therefore, the cost of human labeling is high if the data set is large. To reduce the expense, we need to select the optimal data set for labeling, which produces the best propagation result.

In this paper, we proposed two algorithms for the selection of the optimal data set for labeling, which is the

greedy

and

greedyMax

algorithms according to different user input. We select the data set according to two scenarios, which are 1) finding top-K nodes for labeling and then propagating as much nodes as possible, and 2) finding a minimal set of nodes for labeling and then propagating the whole network with at least one label. Furthermore, we analyze the network structure that affects the selection and propagation results. Our algorithms are suitable for most propagation algorithms. In the experiment part, we evaluate our algorithms based on 500 networks extracted from the film-actor table in freebase according to the two different scenarios. The performance including input percentage, time cost, precision and f1-score were present in the results. And from the results, the greedyMax could achieve higher performance with a balance of precision and time cost than the greedy algorithm. In addition, our algorithm could be adaptive to the user input in a quick response.

Juan Du, Feida Zhu, Ee-Peng Lim

MAKM: A MAFIA-Based k-Means Algorithm for Short Text in Social Networks

Short text clustering is an essential pre-process in social network analysis, where k-means is one of the most famous clustering algorithms for its simplicity and efficiency. However, k-means is instable and sensitive to the initial cluster centers, and it can be trapped in some local optimums. Moreover, its parameter of cluster number

k

is hard to be determined accurately. In this paper, we propose an improved k-means algorithm MAKM (MAFIA-based kmeans) equipped with a new feature extraction method TT (Term Transition) to overcome the shortages. In MAKM, the initial centers and the cluster number

k

are determined by an improved algorithm of Mining Maximal Frequent Item Sets. In TT, we claim that co-occurrence between two words in short text represents greater correlation and each word has certain probabilities of spreading to others. The Experiment on real datasets shows our approach achieves better results.

Pengfei Ma, Yong Zhang

Detecting User Preference on Microblog

Microblog attracts a tremendous large number of users, and consequently affects their daily life deeply. Detecting user preference for profile construction on microblog is significant and imperative, since it facilitates not only the enhancement of users’ utilities but also the promotion of business values (e.g., online advertising, commercial recommendation). Users might be instinctively reluctant to exposure their preferences in their own published messages for the privacy protection issues. However, their preferences can never be concealed in those information they read (or subscribed), since users do need to get something useful in their readings, especially in the microblog application. Based on this observation, in this work, we successfully detect user preference, by proposing to filter out followees’ noisy postings under a dedicated commercial taxonomy, followed by clustering associated topics among followees, and finally by selecting appropriate topics as their preferences. Our extensive empirical evaluation confirms the effectiveness of our proposed method.

Chen Xu, Minqi Zhou, Feng Chen, Aoying Zhou

Query Processing II

MustBlend: Blending Visual Multi-Source Twig Query Formulation and Query Processing in RDBMS

Recently, in [3,9] a novel

xml

query processing paradigm was proposed, where instead of processing a visual

xml

query

after

its construction, it

interleaves

query formulation and processing by exploiting the latency offered by the

gui

to filter irrelevant matches and prefetch partial query results. A key benefit of this paradigm is significant improvement of the

user waiting time

(

uwt

), which refers to the duration between the time a user presses the “Run” icon to the time when the user gets the query results. However, the current state-of-the-art approach that realizes this paradigm suffers from key limitations such as inability to correctly evaluate certain visual query conditions

together

when necessary, large intermediate results space, and inability to handle visual query modifications, limiting its usage in practical environment. In this paper, we present a

rdbms

-based

single

as well as

multi-source

xml

twig

query evaluation algorithm, called

MustBlend

(

MU

lti-

S

ource

T

wig

BLEND

er), that addresses these limitations. A key practical feature of

MustBlend

is its portability as it

does not

employ any special-purpose storage, indexing, and query cost estimation schemes. Experiments on real-world datasets demonstrate its effectiveness and superiority over existing methods based on the traditional paradigm.

Ba Quan Truong, Sourav S Bhowmick

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

The volume of RDF data increases very fast within the last five years, e.g. the Linked Open Data cloud grows from 2 billions to 50 billions of RDF triples. With its wonderful scalability, cloud computing platform like Hadoop is a good choice for processing queries over large data sets. Previous works on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect the performance. Specifically, a good partitioning will greatly reduce or even totally avoid cross-node joins and significantly reduce the cost of query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture where Map/Reduce takes charge of the computing tasks and an RDF query engine, RDF-3X, stores the data and evaluates join operations over local data. Based on analysis of query work-loads, we propose a novel algorithm for automatically partitioning RDF data. We also present an approximate solution to physically place the partitions in order to reduce data redundancy. All the proposed approaches are evaluated by extensive experiments over large RDF data sets.

Tao Yang, Jinchuan Chen, Xiaoyan Wang, Yueguo Chen, Xiaoyong Du

Content Based Retrieval for Lunar Exploration Image Databases

Being a novel research aspect following the recent new round of lunar explorations, content-based lunar image retrieval provides a convenient and efficient way for accessing relevant lunar remote sensing images by their visual contents. In this paper, we introduce a novel method for mining relevant images in lunar exploration databases. A novel feature descriptor derived from relationships of salient craters in lunar images and a compound feature model organizing different features are proposed. Based on the features, similarity measurement rules and a retrieval algorithm are proposed and described in detail. Both theoretical analysis and experimental results of our method are provided, verifying that our features and model are effective and the method can get a good relevant retrieval results in lunar image databases.

Hui-zhong Chen, Ning Jing, Jun Wang, Yong-guang Chen, Luo Chen

Searching Desktop Files Based on Access Logs

People often meet trouble in searching a desktop file when they can not remember exact words of its filename. In this paper, we firstly propose an algorithm to generate access logs by monitoring desktop operations and implement a prototype. By running it in several computers of selected participants we collected a data set of access logs. Then we propose a graph model to represent personal desktop files and their relationships, and highlight two file relationships(content relationship and time relationship) to help users search desktop files. Based on the graph model, we propose a desktop search method, and the experimental results show the feasibility and effectiveness of our methods.

Yukun Li, Xiyan Zhao, Yingyuan Xiao, Xiaoye Wang

An In-Memory/GPGPU Approach to Query Processing for Aspect-Oriented Data Management

Under the paradigm of aspect-oriented data management (AODM), cross-cutting concerns in the data model – like multi-language support or functional versioning – are to be encapsulated and separated from the core aspect data. At runtime, a re-weaving of data influenced by different aspects has to be done. Previous research demonstrated that running queries directly against the referential model of AODM for relational databases via SQL is slow and inefficient. This paper presents an approach to accelerate queries by using a native storage model for aspect specific data and a specialized in-memory as well as a GPGPU query method.

Bernhard Pietsch

Graph Data Management II

On Efficient Graph Substructure Selection

Graphs have a wide range of applications in many domains. The graph substructure selection problem is to find all subgraph isomorphic mappings of a query from multi-attributed graphs, such that each pair of matching vertices satisfy a set of selection conditions, each against an equality, range, or set containment operator on a vertex attribute. Existing techniques for single-labeled graphs are developed under the assumption of identical label matching, and thus, cannot handle the general case in substructure selections. To this end, this paper proposes a two-tier index to support general selections via judiciously materializing certain mappings. Moreover, we propose efficient dynamic query processing and index construction algorithms. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach.

Xiang Zhao, Haichuan Shang, Wenjie Zhang, Xuemin Lin, Weidong Xiao

Parallel Triangle Counting over Large Graphs

Counting the number of triangles in a graph is significant for complex network analysis. However, with the rapid growth of graph size, the classical centralized algorithms can not process triangle counting efficiently. Though some researches have proposed parallel triangle counting implementations on Hadoop, the performance enhancement remains a challenging task. To efficiently solve the parallel triangle counting problem, we put forward a hybrid parallel triangle counting algorithm with efficient pruning methods. In addition, we propose a parallel sample algorithm which can avoid repeated edge sampling and produce high-precision results. We implement our patterns based on bulk synchronous parallel framework. Compared with the Hadoop-based implementation, 2 to 13 times gains can be obtained in terms of executing time.

Wenan Wang, Yu Gu, Zhigang Wang, Ge Yu

Data Mining

Document Summarization via Self-Present Sentence Relevance Model

Automatic document summarization is always attractive to computer science researchers. A novel approach is proposed to address this topic and mainly focuses on the summarization of plain documents. Conventional summarization methods do not fully use the inter-sentence relevance that is not preserved during the processing. In contrast, to tackle the problem and incorporate the latent relations among sentences, our approach constructs relevance structures at sentence-level for plain documents and each sentence is scored with a significance value. Accordingly, important sentences “present” themselves automatically, and the summary paragraph is then generated by selecting top-

k

scored sentences. Convergence of the algorithm is proved, and experiment, which is conducted on two data sets (DUC 2006 and DUC 2007), shows that the proposed model gives convincing results.

Xiaodong Li, Shanfeng Zhu, Haoran Xie, Qing Li

Active Semi-supervised Community Detection Algorithm with Label Propagation

Community detection is the fundamental problem in the analysis and understanding of complex networks, which has attracted a lot of attention in the last decade. Active learning aims to achieve high accuracy using as few labeled data as possible. However, so far as we know, active learning has not been applied to detect community to improve the performance of discovering community structure of complex networks. In this paper, we propose a community detection algorithm called active semi-supervised community detection algorithm with label propagation. Firstly, we transform a given complex network into a weighted network, select some informative nodes using the weighted shortest path method, and label those nodes for community detection. Secondly, we utilize the labeled nodes to expand the labeled nodes set by propagating the labels of the labeled nodes according to an adaptive threshold. Thirdly, we deal with the rest of unlabeled nodes. Finally, we demonstrate our community detection algorithm with three real networks and one synthetic network. Experimental results show that our active semi-supervised method achieves a better performance compared with some other community detection algorithms.

Mingwei Leng, Yukai Yao, Jianjun Cheng, Weiming Lv, Xiaoyun Chen

Computing the Split Points for Learning Decision Tree in MapReduce

The explosive growth of Data is bringing more and more challenges and opportunities to data mining. In data mining, learning decision tree is a common method, in which determining split points is the key problem. Existing methods of calculating split points in the distributed setting on large data either (1) cause high communication overhead or (2) are not universal for different levels of skewness of data distribution. In this paper, we study the properties of Gini impurity, which is a measure for determining split points, and design new algorithms for calculating split points in MapReduce. Empirical evaluation demonstrates that our method outperforms existing state-of-the-art techniques on communication cost and universality.

Mingdong Zhu, Derong Shen, Ge Yu, Yue Kou, Tiezheng Nie

FP-Rank: An Effective Ranking Approach Based on Frequent Pattern Analysis

Ranking documents in terms of their relevance to a given query is fundamental to many real-life applications such as document retrieval and recommendation systems. Extensive studies in this area have focused on developing efficient ranking models. While ranking models are usually trained based on given training datasets, besides model training algorithms, the quality of the document features selected for model training also plays a very important aspect on the model performance. The main objective of this paper is to present an approach to discover “significant” document features for learning to rank (LTR) problem. We conduct a systematic exploration of frequent pattern-based ranking. First, we formally analyze the effectiveness of frequent patterns for ranking. Combined features, which constitute a large portion of frequent patterns, perform better than single features in terms of capturing rich underlying semantics of the documents and hence provide good feature candidates for ranking. Based on our analysis, we propose a new ranking approach called

FP-Rank

. Essentially,

FP-Rank

adopts frequent pattern mining algorithms to mine frequent patterns, and then a new pattern selection algorithm is adopted to select a set of patterns with high overall significance and low redundancy. Our experiments on the real datasets confirm that, by incorporating effective frequent patterns to train a ranking model, such as RankSVM, the performance of the ranking model can be substantially improved.

Yuanfeng Song, Kenneth Leung, Qiong Fang, Wilfred Ng

Applications

A Hybrid Framework for Product Normalization in Online Shopping

The explosive growth of products in both variety and quantity is an obvious evidence for the booming of C2C (Customer-to-Customer) E-commerce. Product normalization, which determines whether products are referring to the same underlying entity, is a fundamental task of data management in C2C market. However, product normalization in C2C market is challenging because the data is noisy and lacks a uniform schema. In this paper, we propose a hybrid framework, which achieves product normalization by the schema integration and data cleaning. In the framework, a graph-based method was proposed to integrate the schema. The missing data was filled and the incorrect data was repaired by using the evidence extracted from surrounding information, such as the title and textual description. We distinguish products by clustering on the product similarity matrix which is learned through logistic regression. We conduct experiments on the real-world data and the experimental results confirm the effectiveness of our design by comparing with the existing methods.

Li Wang, Rong Zhang, Chaofeng Sha, Xiaofeng He, Aoying Zhou

Staffing Open Collaborative Projects Based on the Degree of Acquaintance

We consider the team formation problem in open collaborative projects existing in large community setting such as the

Open Source Software

(OSS) community. Given a query specifying a set of required skills for an open project and an upper bound of team size, the goal is to find a team that maximizes the

Degree of Acquaintance

(DoA) and covers all the required skills in the query. We define the DoA in terms of the team graph connectivity and edge weights, corresponding to the local

Clustering Coefficient

for each team member and the strength of social ties between the team members, respectively. We perform a statistical analysis on historical data to show the importance of the connectivity and social tie strength to the overall productivity of the teams in open projects. We show that the problem defined is NP-hard and present three algorithms, namely, PSTA, STA and NFA, to solve the problem. We experiment the algorithms on a dataset from the OSS community. The results show the effectiveness of the proposed algorithms to find a well acquainted teams satisfying a given query.

Mohammad Y. Allaho, Wang-Chien Lee, De-Nian Yang

Industrial Papers

Who Will Follow Your Shop? Exploiting Multiple Information Sources in Finding Followers

WuXianGouXiang is an O2O(offline to online and vice versa)-based mobile application that recommends the nearby coupons and deals for users, by which users can also follow the shops they are interested in. If the potential followers of a shop can be discovered, the merchant’s targeted advertising can be more effective and the recommendations for users will also be improved. In this paper, we propose to predict the link relations between users and shops based on the following behavior. In order to better model the characteristics of the shops, we first adopt Topic Modeling to analyze the semantics of their descriptions and then propose a novel approach, named INtent Induced Topic Search (INITS) to update the hidden topics of the shops with and without a description. In addition, we leverage the user logs and search engine results to get the similarity between users and shops. Then we adopt the latent factor model to calculate the similarity between users and shops, in which we use the multiple information sources to regularize the factorization. The experimental results demonstrate that the proposed approach is effective for detecting followers of the shops and the INITS model is useful for shop topic inference.

Liang Wu, Alvin Chin, Guandong Xu, Liang Du, Xia Wang, Kangjian Meng, Yonggang Guo, Yuanchun Zhou

Performance of Serializable Snapshot Isolation on Multicore Servers

Snapshot isolation (SI) is a widely studied concurrency control approach, with great impact in practice within platforms such as Oracle or SQL Server. Berenson

et al.

showed though that SI does not guarantee serializable execution; in certain situations, data consistency can be violated through concurrency between correct applications. Recently, variants of SI have been proposed, that keep the key properties such as (often) allowing concurrency between reads and updates, and that also guarantee that every execution will be serializable. We have had the opportunity to use three implementations of two different algorithms of this type, all based on the InnoDB open source infrastructure. We measure the performance attained by these implementations, on high-end hardware with a substantial number of cores. We explore the impact of the differences in algorithm, and also of the low-level implementation decisions.

Hyungsoo Jung, Hyuck Han, Alan Fekete, Uwe Röhm, Heon Y. Yeom

A Hybrid Approach for Relational Similarity Measurement

Relational similarity measurement between word-pairs is important in many natural language processing tasks such as information extraction and information retrieval. The paper proposes a hybrid approach for relational similarity measurement based on various aspects including

term co-occurrence

,

lexicon-syntactic patterns

, as well as their

combinations

. In this approach, we first extract two relation-term sets from sentences of Wikipedia documents in which two words coincide, and compute the semantic relatedness score of each word-pair in the two relation-term sets. Second, we model the semantic relatedness value of two words together with their frequencies as a point in the three-dimensional space. Afterward, we apply DBSCAN - the classic density-based spatial clustering algorithm to group these 3D points. We finally calculate the similarity based on the clusters. We evaluate this hybrid approach using the well-known 374 SAT analogy questions. The experimental results show that our approach can significantly reduce computational time for measuring relational similarity with a relatively higher score of 52.9% compared to the state-of-the-art.

Zhao Lu, Zhixian Yan

Demo Papers I: Data Mining

Subspace MOA: Subspace Stream Clustering Evaluation Using the MOA Framework

Most available static data are becoming more and more high-dimensional. Therefore, subspace clustering, which aims at finding clusters not only within the full dimension but also within subgroups of dimensions, has gained a significant importance. Recently,

OpenSubspace

framework was proposed to evaluate and explorate subspace clustering algorithms in WEKA with a rich body of most state of the art subspace clustering algorithms and measures. Parallel to it, MOA (

M

assive

O

nline

A

nalysis) framework was developed also above WEKA to provide algorithms and evaluation methods for mining tasks on evolving data streams over the full space only.

Similar to static data, most streaming data sources are becoming high-dimensional, and tracking their evolving clusters is also becoming important and challenging. In this demonstrator, we present, to the best of our knowledge, the first subspace clustering evaluation framework over data streams called

Subspace MOA

. Our demonstrator follows the online-offline model which is used in most data stream clustering algorithms. In the online phase, users have the possibility to select one of three most famous summarization techniques to form the microclusters. In the offline phase, one of five subspace clustering algorithms can be selected. The framework is supported with a subspace stream generator, a visualization interface to present the evolving clusters over different subspaces, and various subspace clustering evaluation measures.

Marwan Hassani, Yunsu Kim, Thomas Seidl

Symbolic Trajectories in SECONDO: Pattern Matching and Rewriting

In this paper, we introduce a novel data model for representing symbolic trajectories along with a pattern language enabling both the matching and the rewriting of trajectories. We illustrate in particular the trajectory data type and two operations for querying symbolic trajectories inside the database system

Secondo

. As an important application of our theory, the classification and depiction of a set of real trajectories according to several criteria is demonstrated.

Fabio Valdés, Maria Luisa Damiani, Ralf Hartmut Güting

ReTweet p : Modeling and Predicting Tweets Spread Using an Extended Susceptible-Infected- Susceptible Epidemic Model

Retweeting is one of the most commonly used tools on Twitter. It offers an easy yet powerful way to propagate interesting tweets one has read to his/her followers without auditing. Understanding and predicting tweets’ retweeting extents is valuable and important for a number of tasks such as hot topic detection, personalized message recommendation, fake information prevention, etc. Through the analysis of similarity and difference between epidemic spread and tweets spread, we extend the traditional Susceptible-Infected-Susceptible (SIS) epidemic model as a model of tweets spread, and build a system called

ReTweet

p

to predict tweets’ future retweeting trends based on the model. Experiments on Chinese micro-blog Tencent show that the proposed model is superior compared to the traditional prediction methods.

Yiping Li, Zhuonan Feng, Hao Wang, Shoubin Kong, Ling Feng

TwiCube: A Real-Time Twitter Off-Line Community Analysis Tool

As a micro-blogging service, Twitter differs from other social network services in two ways: 1) the absence of mutual consent in establishing follow links and 2) being a mixture of news media and social network. A key question to ask in better understanding Twitter user behavior is which part of a user’s Twitter network reflects one’s real-life social network. TwiCube is an online tool that employs a novel algorithm capable of identifying a user’s real-life social community, which we call the user’s off-line community, purely from examining the link structure among the user’s followers and followees. Based on the identified off-line community, TwiCube provides a summary of the user’s interests, tweeting habits and neighborhood popularity analysis. Evaluations from real Twitter users demonstrate that our off-line community detection approach achieves high precision and recall in most cases.

Juan Du, Wei Xie, Cheng Li, Feida Zhu, Ee-Peng Lim

Demo Papers II: Database Applications

TaskCardFinder: An Aviation Enterprise Search Engine for Bilingual MRO Task Cards

While conventional search engines demonstrate the power of delivering information to ones’ fingertips, the complexity of enterprise information raises a number of challenges to enterprise search engines in the nature of unstructured contents, task-relevance, result presentation and multiple languages. We show how we tackle the challenges in aviation field through the development of an enterprise search engine for English-Chinese MRO (

M

aintenance,

R

epair and

O

verhaul) task cards, called

TaskCardFinder

. It enables technicians and planners to quickly find out bilingual task cards related to a specific service request coming from airlines. Several context-awareness features is demonstrated, including context-aware preference search and recommendation, recall search by context, keywords suggestion, navigational and analysis-oriented search, bilingual search support and dynamic result presentation. A user study is done at an international aviation MRO company. The system is demonstrated in the video,

www.youtube.com/watch?v=vj7u_VfRFZw

.

Qingwei Liu, Hao Wang, Tangjian Deng, Ling Feng

EntityManager: An Entity-Based Dirty Data Management System

Dirty data exist in many systems. Efficient and effective management of dirty data is in demand. Since data cleaning may result in the the loss of useful data and new dirty data, we attempt to manage dirty data without cleaning and retrieve query result according to the quality requirement of users. Since entity is the unit for understanding objects in the world and many dirty data are led by different descriptions of the same real-world entity, we propose EntityManager, a dirty data management system with entity as the basic unit and keep conflicts in data as uncertain attributes. Even though the query language is SQL , the query in our system has different semantics on dirty data. In the demonstration, we will show a new philosophy for managing dirty data around entities. We will present our prototype allowing load dirty data and query dirty data according to the requirement of users.

Hongzhi Wang, Xueli Liu, Jianzhong Li, Xing Tong, Long Yang, Yakun Li

RelRec: A Graph-Based Triggering Object Relationship Recommender System

Numerous services (email, motion sensor, etc.) emerge and tend to function more comprehensively. What comes with this is the increasing attention to collaboration between them. For example,

IFTTT

(IF This Then That) enables people to set triggering relationships between various services to be automatically implemented in the cloud.

RelRec

is an triggering object relationship recommender system, building a bipartite graph representing the relationships between services. We propose an algorithm to rate relationships by similarity, and diversify the results by a modified classic method from graph theory.

Yuwen Dai, Guangyao Li, Ruoyu Li

IndoorDB: Extending Oracle to Support Indoor Moving Objects Management

Managing moving objects in indoor space has been a research focus in recent years, as most people live and work in indoor space, e.g. working in office, living in apartment, etc. In this paper, we present an extension of Oracle named IndoorDB to support indoor moving objects management in a practical way. The extension is developed as a PL/SQL package and can be integrated into Oracle to offer new data types and operations for indoor location-based queries such as indoor navigation, hot spots detection, KNN, range queries, and so on. After an overview of the general features of IndoorDB, we discuss the architecture and implementation of IndoorDB. And finally, a case study of IndoorDB’s demonstration is presented.

Qianyuan Li, Peiquan Jin, Lei Zhao, Shouhong Wan, Lihua Yue

HITCleaner: A Light-Weight Online Data Cleaning System

Data quality is essential in many applications. To reduce the harm of the data in low quality, data cleaning is one of effective solutions. However, existing data clean systems can clean data in some special aspect and require relative complex input. To clean data with complex quality problem for various kinds of users, we develop HITCleaner as a light weight online data cleaning system which could handle various types of data quality problem. HITCleaner provides users an elegant interface to upload dirty data and download cleaned data. It also permits users to clean data with various parameters and components flexibly. In this demonstration, we present a tour of HITCleaner, highlighting a few of its key features. We will demonstrate examples for data cleaning. In particular, we will show the flexibility of HITCleaner for cleaning data.

Hongzhi Wang, Jianzhong Li, Ran Huo, Li Jia, Lian Jin, Xueying Men, Hui Xie

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Graph Data Management I

Shortest Path Computation over Disk-Resident Large Graphs Based on Extended Bulk Synchronous Parallel Methods

Fast SimRank Computation over Disk-Resident Graphs

S-store: An Engine for Large RDF Graph Integrating Spatial Information

Physical Design

Physical Column Organization in In-Memory Column Stores

Semantic Data Warehouse Design: From ETL to Deployment à la Carte

A Specific Encryption Solution for Data Warehouses

NameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System

Knowledge Management

Mapping Entity-Attribute Web Tables to Web-Scale Knowledge Bases

ServiceBase: A Programming Knowledge-Base for Service Oriented Development

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

MFSV: A Truthfulness Determination Approach for Fact Statements

Temporal Data Management

A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters

Event Relationship Analysis for Temporal Event Search

Social Networks II

Dynamic Label Propagation in Social Networks

MAKM: A MAFIA-Based k-Means Algorithm for Short Text in Social Networks

Detecting User Preference on Microblog

Query Processing II

MustBlend: Blending Visual Multi-Source Twig Query Formulation and Query Processing in RDBMS

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

Content Based Retrieval for Lunar Exploration Image Databases

Searching Desktop Files Based on Access Logs

An In-Memory/GPGPU Approach to Query Processing for Aspect-Oriented Data Management

Graph Data Management II

On Efficient Graph Substructure Selection

Parallel Triangle Counting over Large Graphs

Data Mining

Document Summarization via Self-Present Sentence Relevance Model

Active Semi-supervised Community Detection Algorithm with Label Propagation

Computing the Split Points for Learning Decision Tree in MapReduce

FP-Rank: An Effective Ranking Approach Based on Frequent Pattern Analysis

Applications

A Hybrid Framework for Product Normalization in Online Shopping

Staffing Open Collaborative Projects Based on the Degree of Acquaintance

Industrial Papers

Who Will Follow Your Shop? Exploiting Multiple Information Sources in Finding Followers

Performance of Serializable Snapshot Isolation on Multicore Servers

A Hybrid Approach for Relational Similarity Measurement

Demo Papers I: Data Mining

Subspace MOA: Subspace Stream Clustering Evaluation Using the MOA Framework

Symbolic Trajectories in SECONDO: Pattern Matching and Rewriting

ReTweet p : Modeling and Predicting Tweets Spread Using an Extended Susceptible-Infected- Susceptible Epidemic Model

TwiCube: A Real-Time Twitter Off-Line Community Analysis Tool

Demo Papers II: Database Applications

TaskCardFinder: An Aviation Enterprise Search Engine for Bilingual MRO Task Cards

EntityManager: An Entity-Based Dirty Data Management System

RelRec: A Graph-Based Triggering Object Relationship Recommender System

IndoorDB: Extending Oracle to Support Indoor Moving Objects Management

HITCleaner: A Light-Weight Online Data Cleaning System

Backmatter