Social Computing

A Unified Semi-supervised Framework for Author Disambiguation in Academic Social Network

This paper addresses the author disambiguation problem in academic social network, namely, resolves the phenomenon of synonym problem “multiple names refer to one person” and polysemy problem “one name refers to multiple persons”. A unified semi-supervised framework is proposed to deal with both the synonym and polysemy problems. First, the framework uses semi-supervised approach to solve the cold-start problem in author disambiguation. Second, robust training data generating method based on multi-aspect similarity indicator is used and a way based on support vector machine is employed to model different kinds of feature combinations. Third, a self-taught procedure is proposed to solve ambiguity in coauthor information to boost the performances from other models. The proposed framework is verified on a large-scale real-world dataset, and obtains promising results.

Peng Wang, Jianyu Zhao, Kai Huang, Baowen Xu

Designing Incentives for Community-Based Mobile Crowdsourcing Service Architecture

Good design strategies for designing social media are important for their success, but current designs are usually ad-hoc, relying on human intuition. In this paper, we present an overview of three community-based mobile crowdsourcing services that we have developed as case studies. In community-based mobile crowdsourcing services, people voluntarily contribute to help other people anytime and anywhere using mobile phones. The task required is usually trivial, so people can perform it with a minimum effort and low cognitive load. This approach is different from traditional ones because service architecture designers need to consider the tradeoff among several types of incentives when designing a basic architecture. We then extract six insights from our experiences to show that motivating people is the most important factor in designing mobile crowdsourcing service architecture. The design strategies of community-based mobile crowdsourcing services explicitly consider the tradeoff among multiple incentives. This is significantly different from the design in traditional crowdsourcing services because their designers usually consider only a few incentives when designing respective social media. The insights are valuable lessons learned while designing and operating the case studies and are essential to successful design strategies for building future more complex crowdsourcing services.

Mizuki Sakamoto, Hairihan Tong, Yefeng Liu, Tatsuo Nakajima, Sayaka Akioka

Arguing Prism: An Argumentation Based Approach for Collaborative Classification in Distributed Environments

This paper focuses on collaborative classification, concerning how multiple classifiers learned from distributed data repositories, can come to reach a consensus. Interpretability, in this context, is favored for the reason that they can be used to identify the key features influencing the classification outcome. In order to address this problem, we present Arguing Prism, an argumentation based approach for collaborative classification. The proposed approach integrates the ideas from modular classification rules inductive learning and multi-agent dialogue games. In particular, argumentation is used to provide an interpretable classification paradigm in distributed environments, rather than voting mechanisms. The results of experiments reveal that Arguing Prism performs better than individual classifier agents voting schemes. Moreover, an interpretable classification can be achieved without losing much classification performance, when compared with ensemble classification paradigms. Further experiment results show that Arguing Prism out-performs comparable classification paradigms in presence of inconsistent data, due to the advantages offered by argumentation.

Zhiyong Hao, Li Yao, Bin Liu, Yanjuan Wang

Similarity Search

Rank Aggregation of Candidate Sets for Efficient Similarity Search

Many current applications need to organize data with respect to mutual similarity between data objects. Generic similarity retrieval in large data collections is a tough task that has been drawing researchers’ attention for two decades. A typical general strategy to retrieve the most similar objects to a given example is to access and then refine a

candidate set

of objects; the overall search costs (and search time) then typically correlate with the candidate set size. We propose a generic approach that combines several independent indexes by aggregating their candidate sets in such a way that the resulting candidate set can be one or two orders of magnitude smaller (while keeping the answer quality). This achievement comes at the expense of higher computational costs of the ranking algorithm but experiments on two real-life and one artificial datasets indicate that the overall gain can be significant.

David Novak, Pavel Zezula

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

The problem of near–duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high–dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near–duplicates whenever new multimedia content is uploaded. Among different approaches, near–duplicate detection in high–dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real–world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

Marco Fisichella, Andrea Ceroni, Fan Deng, Wolfgang Nejdl

Fast Phonetic Similarity Search over Large Repositories

Analysis of unstructured data may be inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary. However, they are not rich enough to encode phonetic information to assist the search. In this paper, we present a novel approach for efficiently perform phonetic similarity search over large data sources, that uses a data structure called

PhoneticMap

to encode language-specific phonetic information. We validate our approach through an experiment over a data set using a Portuguese variant of a well-known repository, to automatically correct words with spelling errors.

Hegler Tissot, Gabriel Peschl, Marcos Didonet Del Fabro

Ranking

Named Entity Oriented Related News Ranking

To support the gathering of information from various viewpoints, we focus on descriptions of named entities (persons, organizations, locations) in news articles. We propose methods to rank news articles based on analyzing difference in descriptions of named entities. We extend the stakeholder mining proposed by Ogawa et al. and extract descriptions of named entities in articles. Then, four ranking measures (

relatedness, diversity, difference of polarity, diffeence of detailedness

) are calculated by analyzing the presence or absence of named entities, the coverage of topics and the polarity of descriptions. We carry out user study and experiments to validate our methods.

Keisuke Kiritoshi, Qiang Ma

Semantic Path Ranking Scheme for Relational Keyword Queries

Existing works on keyword search over relational databases typically do not consider users’ search intention for a query and return many answers which often overwhelm users. We observe that a database is in fact a repository of real world objects that interact with each other via relationships. In this work, we identify four types of semantic paths between objects and design an algorithm called

pathRank

to compute and rank the results of keyword queries. The answers are grouped by the types of semantic paths which reflect different query interpretations, and are annotated to facilitate user understanding.

Zhong Zeng, Zhifeng Bao, Gillian Dobbie, Mong Li Lee, Tok Wang Ling

Fraud Indicators Applied to Legal Entities: An Empirical Ranking Approach

Legal persons (i.e., entities such as corporations, companies, partnerships, firms, associations, and foundations) may commit financial crimes or employ fraudulent activities like money laundering, tax fraud, or bankruptcy fraud. Therefore, in the Netherlands legal persons are automatically screened for misuse based on a set of so called risk indicators. These indicators, which are based on the data obtained from, among others, the Netherlands Chamber of Commerce, the Dutch police, and the Dutch tax authority, encompass information about certain suspicious behaviours and past insolvencies or convictions (criminal records). In order to determine whether there is an increased risk of fraud, we have devised a number of scoring functions to give a legal person a score on each risk indicator based on the registered information about the legal person and its representatives. These individual scores are subsequently combined and weighed into a total risk score that indicates whether a legal person is likely to commit fraud based on all risk indicators. This contribution reports on our two ranking approaches: one based on the empirical probabilities of the indicators and the other based on the information entropy rate of the empirical probabilities.

Susan van den Braak, Mortaza S. Bargh, Sunil Choenni

Data Mining

A Probabilistic Approach to Detect Local Dependencies in Streams

Given

m

source streams (

X

1

,

X

2

, ...,

X

m

) and one target data stream

Y

, at any time window

w

, we want to find out which source stream has the strongest dependency to the current target stream value. Existing solutions fail in several important dependency cases, such as the not-similar-but-frequent patterns, the signals with multiple lags, and the single point dependencies. To reveal these hard-to-detect local patterns in streams, a statistical model based framework is developed, together with an incremental update algorithm. Using the framework, a new scoring function based on the conditional probability is defined to effectively capture the local dependencies between any source stream and the target stream. Immediate real life applications include quickly identifying the causal streams with respect to a Key Performance Indicator (KPI) in a complex production system, and detecting locally correlated stocks for an interesting event in the financial system. We apply this framework to two real data sets to demonstrate its advantages compared with the Principal Component Analysis (PCA) based method [16] and the naive local Pearson implementation.

Qiyang Duan, MingXi Wu, Peng Wang, Wei Wang, Yu Cao

Inferring Topic-Level Influence from Network Data

Existing influence analysis research has largely focused on studying the maximization of influence spread in the whole network, or inferring the “hidden” network from a list of observations. There is little work on topic-level specific influence analysis. Although some works try to address this problem, their methods depend on known social network structure, and do not consider temporal factor which plays an important role in determining the degree of influence. In this paper, we take into account the temporal factor to infer the influential strength between users at topic-level. Our approach does not require the underlying network structure to be known. We propose a guided hierarchical LDA approach to automatically identify topics without using any structural information. We then construct the topic-level influence network incorporating the temporal factor to infer the influential strength among the users for each topic. Experimental results on two real world datasets demonstrate the effectiveness of our method. Further, we show that the proposed topic-level influence network can improve the precision of user behavior prediction and is useful for influence maximization.

Enliang Xu, Wynne Hsu, Mong Li Lee, Dhaval Patel

A Synergy of Artificial Bee Colony and Genetic Algorithms to Determine the Parameters of the Σ-Gram Distance

In a previous work we presented the Σ-gram distance that computes the similarity between two sequences. This distance includes parameters that we calculated by means of an optimization process using artificial bee colony. In another work we showed how population-based bio-inspired algorithms can be sped up by applying a method that utilizes a pre-initialization stage to yield an optimal initial population. In this paper we use this pre-initialization method on the artificial bee colony algorithm to calculate the parameters of the Σ-gram distance. We show through experiments how this pre-initialization method can substantially speed up the optimization process.

Muhammad Marwan Muhammad Fuad

Web Application Relations Mapping

Web applications are developed and modified often so fast that architects, developers, and managers lose their control over quality of such software products. Reverse engineering techniques focused on different aspects of software implementation might help in keeping comprehension to implementation of web applications at appropriate level required for fulfilling change requests. We focus our effort on a reconstruction of implicit associations among GUI data items and back-end database entities. Our approach is based on an association analysis and mining using a synchronized log of GUI events and related counterparts of SQL statements. Recovered dependencies between GUI and back-end databases can be utilized advantageously in an automated design of web application data flow testing.

Radek Mařík, Zdeněk Kouba, Michal Pantůček

Towards Practical Anomaly-Based Intrusion Detection by Outlier Mining on TCP Packets

Intrusion detection System (IDS) is an important part of the security of large networks like the Internet. With increasing number of data being transmitted day by day from one subnetwork to another, the system needs to identify intrusion in such large datasets in an effectively and timely manner. So the application of knowledge discovery comes handy to identify unusual accesses or attacks. Improving an IDS’s performance and accuracy is one of the major challenges network security research today. In this paper, we propose a practical anomaly-based IDS using outlier mining of the readily available basic Transmission Control Protocol (TCP) header information as well as other easily derivable attributes. We use a two-step approach of k-means clustering and one-class support vector machine (SVM) to model the normal sessions presented in MIT DARPA ’99 dataset. We then feed the testing set to the resultant model to predict the attacks sessions.

Prajowal Manandhar, Zeyar Aung

Big Data

Survey on Big Data and Decision Support Benchmarks

Benchmarking is a common practice for the evaluation of database computer systems. By executing certain benchmarks, manufacturers and researchers are able to highlight the characteristics of a certain system and are also able to rank the system against the rest. On the other hand, at the moment, BigData is a hot topic. It concerns dealing efficiently with information that is challenging to handle, due to volume, velocity or variety. As more and more platforms are proposed to deal with BigData, it becomes important to have benchmarks that can be used to evaluate performance characteristics of such platforms. At the same time, Decision Support applications are related to BigData, as they need to efficiently deal with huge datasets. In this paper we describe benchmarks representing Decision Support Systems (TPC-H, SSB, TPC-DS), and benchmarks for the Big Data class (YCSB, BigBench, and BigFrame), in order to help users to choose the most appropriate one for their needs. We also characterize the relationship between Big Data benchmarks and Decision Support benchmarks.

Melyssa Barata, Jorge Bernardino, Pedro Furtado

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Processing XML queries over big XML data using MapReduce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into different compute nodes. This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query processing approach, which uses only related inverted lists to process queries in order to reduce I/O cost, we propose a novel technique to use MapReduce to distribute labels in inverted lists in a computing cluster, so that structural joins can be parallelly performed to process queries. We also propose an optimization technique to reduce the computing space in our framework, to improve the performance of query processing. Last, we conduct experiment to validate our algorithms.

Huayu Wu

Optimization Strategies for Column Materialization in Parallel Execution of Queries

All parallel query processing frameworks need to determine the optimality norms for column materialization. We investigate performance trade-off of alternative column materialization strategies. We propose a common parallel query processing approach that encapsulates varying column materialization strategies within

exchange nodes

in query execution plans. Our experimental observations confirm the theoretically deduced trade-offs that suggest optimality norms to be dependent on the scale of the cluster, data transmissions required for a query, and the predicate selectivities involved. Lastly, we have applied a probit statistical model to the experimental data in order to establish a systemdependent adhoc performance estimation method that can be used to select the optimal materialization strategy at runtime.

Chi Ku, Yanchen Liu, Masood Mortazavi, Fang Cao, Mengmeng Chen, Guangyu Shi

Evaluating Cassandra Scalability with YCSB

NoSQL data stores appeared to fill a gap in the database market: that of highly scalable data storage that can be used for simple storage and retrieval of key-indexed data while allowing easy data distribution over a possibly large number of servers. Cassandra has been pinpointed as one of the most efficient and scalable among currently existing NoSQL engines. Scalability of these engines means that, by adding nodes, we could have more served requests with the same performance and more nodes could result in reduced execution time of requests. However, we will see that adding nodes not always results in performance increase and we investigate how the workload, database size and the level of concurrency are related to the achieved scaling level. We will overview Cassandra data store engine, and then we evaluate experimentally how it behaves concerning scaling and request time speedup. We use the YCSB – Yahoo! Cloud Serving Benchmark for these experiments.

Veronika Abramova, Jorge Bernardino, Pedro Furtado

Approximations

An Improved Method for Efficient PageRank Estimation

PageRank is a link analysis method to estimate the importance of nodes in a graph, and has been successfully applied in wide range of applications. However, its computational complexity is known to be high. Besides, in many applications, only a small number of nodes are of interest. To address this problem, several methods for estimating PageRank score of a target node without accessing whole graph have been proposed. In particular, Chen et al. proposed an approach where, given a target node, subgraph containing the target is induced to locally compute PageRank score. Nevertheless, its computation is still time consuming due to the fact that a number of iterative processes are required when constructing a subgraph for subsequent PageRank estimation. To make it more efficient, we propose an improved approach in which a subgraph is recursively expanded by solving a linear system without any iterative computation. To assess the efficiency of the proposed scheme, we conduct a set of experimental evaluations. The results reveal that our proposed scheme can estimate PageRank score more efficiently than the existing approach while maintaining the estimation accuracy.

Yuta Sakakura, Yuto Yamaguchi, Toshiyuki Amagasa, Hiroyuki Kitagawa

An Enhanced Genetic Algorithm for Web Service Location-Allocation

Network latency has a significant impact on the response time of web services. Thus, the proper choice of network locations for the deployment of web services is of major importance for the performance of web services. In this paper, we present an enhanced genetic algorithm with self-adaptive feature and memory filter to solve the location-allocation problem for web services. A simulated experiment is conducted using the WS-DREAM dataset with 8 different complexities. The results show that our approach is able to efficiently compute good solutions for the location-allocation problem.

Hai Huang, Hui Ma, Mengjie Zhang

Ascertaining Spam Web Pages Based on Ant Colony Optimization Algorithm

Web spam is troubling both internet users and search engine companies, because it seriously damages the reliability of search engine and the benefit of Web users, degrades the Web information quality. This paper discusses a Web spam detection method inspired by Ant Colony Optimization (ACO) algorithm. The approach consists of two stages: preprocessing and Web spam detection. On preprocessing stage, the class-imbalance problem is solved by using a clustering technique and an optimal feature subset is culled by Chi-square statistics. The dataset is also discretized based on the information entropy method. These works make the spam detection at the second stage more efficient and easier. On next stage, spam detection model is built based on the ant colony optimization algorithm. Experimental results on the WEBSPAM-UK2006 reveal that our approach can achieve the same or even better results with less number of features.

Shou-Hong Tang, Yan Zhu, Fan Yang, Qing Xu

Reducing Redundant Information in Search Results Employing Approximation Algorithms

It is widely accepted that there are many Web documents that contain identical or near-identical information. Modern search engines have developed duplicate detection algorithms to eliminate this problem in the search results, but difficulties still remain, mainly because the structure and the content of the results could not be changed. In this work we propose an effective methodology for removing redundant information from search results. Using previous methodologies, we extract from the search results a set of composite documents called

SuperTexts

and then, by applying novel approximation algorithms, we select the SuperTexts that better reduce the redundant information. The final results are next ranked according to their relevance to the initial query. We give some complexity results and experimentally evaluate the proposed algorithms.

Christos Makris, Yannis Plegas, Yannis C. Stamatiou, Elias C. Stavropoulos, Athanasios K. Tsakalidis

Privacy

Dynamic Privacy Policy Management in Services-Based Interactions

Technology advancements have enabled the distribution and sharing of patient personal health data over several data sources. Each data source is potentially managed by a different organization, which expose its data as a Web service. Using such Web services, dynamic composition of atomic data type properties coupled with the context in which the data is accessed may breach sensitive data that may not comply with the users preference at the time of data collection. Thus, providing uniform access policies to such data can lead to privacy problems. Some fairly recent research has focused on providing solutions for dynamic privacy policy management. This paper advances these techniques, and fills some gaps in the existing works. In particular, dynamically incorporating user access context into the privacy policy decision, and its enforcement. We provide a formal model definition of the proposed approach and a preliminary evaluation of the model.

Nariman Ammar, Zaki Malik, Elisa Bertino, Abdelmounaam Rezgui

Ephemeral UUID for Protecting User Privacy in Mobile Advertisements

Recently, a universally unique identifier (UUID) for targeting ad services is supported by a smartphone OS, which can be reset and/or halted by a user. However, when a user resets the UUID, all targeted ad libraries are initialized. It means that the user cannot control ad libraries one at a time. As the user interests managed by the same UUID are easily exchanged between the ad service provider and another ad service provider, the user is anxious about the privacy violation. In addition, as the UUID can be tapped on the unencrypted network, the replay attack using the tapped UUID violates the user’s privacy. In this paper, we propose a privacy enhanced UUID that is generated by each ad service provider. As the UUID is encrypted with the time information by each access, the value of

“Enc(Time, UUID)”

is changed frequently. Thus, we call it an ephemeral UUID. The ephemeral UUID is shared through the same ad libraries in any applications, but it cannot be shared through the other ad libraries. Our UUID can be reset by the user and cannot be extracted on the network.

Keisuke Takemori, Toshiki Matsui, Hideaki Kawabata, Ayumu Kubota

Voronoi-Based Spatial Cloaking Algorithm over Road Network

Privacy preservation has recently received considerable attention in location-based services. A large number of location cloaking algorithms have been proposed for protecting the location privacy of mobile users over road networks. However, most existing cloaking approaches traverse the entire road network repetitively, which has a significant negative impact over the communication and anonymization efficiency. To address this problem, we propose a network Voronoi-based location anonymization algorithm. We employ the network Voronoi diagram to pre-divide the road network into network Voronoi cells. The problem of anonymization over global road networks is reduced to finding cloaking sets in local network Voronoi cells. Experimental results validate the efficiency and effectiveness of the proposed algorithm.

Xiao Pan, Lei Wu, Zhaojun Hu, Zheng Huo

Data Exchange

Unique Solutions in Data Exchange

Data exchange is the problem of transforming data structured according to a source schema into data structured according to a target schema, via a mapping specified by rules in the form of source-to-target tuple generating dependencies. In this context, given a source instance and a mapping, there might be more than one valid target instance that satisfies the mapping. This issue contradicts the main goal of exchanging data, namely to have a materialised target instance that can be used to answer queries over the target schema without reference to the original source instance. In this paper we introduce and solve the novel problem of definability abduction, which aims at finding extensions to the initial schema mappings to guarantee the uniqueness of the materialised target instance. We consider several semantic criteria to select reasonable extensions and provide provably sound and complete algorithms to generate these additions. We also do a complexity analysis in different data exchange settings, also with source and target dependencies.

Nhung Ngo, Enrico Franconi

Translatable Updates of Selection Views under Constant Complement

Given a lossless view associating a source relation with a set of target relations defined by selection queries over the source, we study how updates of the target relations can be consistently and univocally propagated to the underlying source relation. We consider a setting where some of the attributes in the schema are

interpreted

over some specific domain (e.g., the reals or the integers) whose data values can be compared beyond equality, by means of special predicates (e.g., smaller/greater than) and functions (e.g., addition and subtraction). The source schema is constrained by

conditional domain constraints

, which restrict the values that are admissible for the interpreted attributes whenever a certain condition is satisfied by the values taken by the non-interpreted ones.

We show how to decide whether insertions, deletions and replacements, as well as sequences of insertions and deletions, can be univocally propagated through lossless selection views. In general, a lossy view, which does not preserve the whole informative content of the source, can always be turned into a lossless one by means of a

view complement

, which provides the missing information. For lossy selection views, we show how to find complements that provide the smallest amount of information needed to achieve losslessness, so as to maximise the number of updates that can be propagated under the so-called

constant complement principle

, prescribing that the complement be invariant during update propagation.

Enrico Franconi, Paolo Guagliardo

Data Integration

Data Integration by Conceptual Diagrams

We present a visual language based approach and tool to perform data integration at conceptual level, aiming to reduce the complexity of such task when integrating numerous and complex data sources. The visual language provides iconic operators to manipulate the constructs of conceptual data schemas of database sources, in order to specify how to merge and map them to a reconciled schema. The proposed tool allows not only to generate the relational reconciled schema, but also to automatically generate metadata and inference mechanisms to guarantee the loading and periodical refresh of data from source databases. Finally, we evaluated CoDIL through a usability study.

Loredana Caruccio, Vincenzo Deufemia, Mara Moscariello, Giuseppe Polese

DatalogBlocks: Relational Logic Integration Patterns

Although most of the business application data is stored in relational databases, the programming languages in integration middleware systems - connecting applications - are not relational data-centric. Due to unnecessary data-shipments and faster computation, some middleware system vendors consider to “push-down” integration operations closer to the database systems.

We address the opposite case, which is “moving-up” relational logic programming for implementing the integration semantics within a standard middleware system. These semantics can be described by the well-known

Enterprise Integration Patterns

. For declarative and more efficient middleware pipeline processing, we combine these patterns with Datalog

+

and discuss their expressiveness and practical realization by example.

Daniel Ritter, Jan Bross

Secure Data Integration: A Formal Concept Analysis Based Approach

Integrating and sharing information, across disparate data sources, entail several challenges: autonomous data objects are split across multiple sources. They are often controlled by different security paradigms and owned by different organizations. To offer a secure unique access point to these sources, we propose two-step approach based on Formal Concept Analysis. First, it derives a global vision of local access control policies. Second, it generates a mediated schema and the GAV/LAV mapping relations, while preserving the local source properties such as security.

Mokhtar Sellami, Mohamed Mohsen Gammoudi, Mohand Said Hacid

Web Semantics

A Lateral Thinking Framework for Semantic Modelling of Emergencies in Smart Cities

Manual definition of models for emergency management scenarios is a demanding activity due to the huge number of different situations to be considered. It requires knowledge related to the crisis and emergency domains, to the context (e.g., a specific city and its current regulations) and to modelling techniques. In this paper, we propose to tackle this problem according to a lateral thinking perspective and, following this line, we present a framework supporting automatic creation of conceptual models concerning emergency management scenarios by means of semantic techniques. In particular, this framework relies on an ontology and on a set of semantic rules to manage, respectively, the domain and contextual knowledge, and on the design patterns approach to support the modelling activity. A software experimentation of the framework based on SPARQL and applied to emergency scenarios in smart cities is proposed to demonstrate the viability of the approach.

Antonio De Nicola, Michele Melchiori, Maria Luisa Villani

An Approach for Automatic Query Expansion Based on NLP and Semantics

Nowadays, there is a huge amount of digital data stored in repositories that are queried by search systems that rely on keyword-based interfaces. Therefore, the retrieval of information from repositories has become an important issue. Organizations usually implement architectures based on relational databases that do not consider the syntax and semantics of the data. To solve this problem, they perform complex Extract, Transform and Load (ETL) processes from relational repositories to triple stores. However, most organizations do not carry out this migration due to lack of time, money and knowledge.

In this paper we present a methodology that performs an automatic query expansion based on natural language processing and semantics to improve information retrieval from relational databases repositories. We have integrated it into an existing system in a real Media Group organization and we have tested it to analyze its effectiveness. Results obtained are promising and show the interest of the proposal.

María G. Buey, Ángel Luis Garrido, Sergio Ilarri

Mapping Construction Compliant with Schema Semantics

A dominant characteristic of autonomous information sources is their heterogeneity, in terms of data formats and schemas. We need schema and data mappings between such sources in order to query them in a uniform and systematic manner. Guiding the discovery of mappings employing automatic tools is one of the fundamental unsolved challenges of data interoperability. In this work we consider the problem of discovering mappings for schemas of autonomous sources that can gradually revealed. Using as an example setting an overlay of peer databases, we present a mapping solution that discovers mappings, which can be adapted to new and gradually revealed schema information. Mapping discovery is schema-centric and incorporates new semantics as they are unveiled.

Verena Kantere

Web Materialization Formulation: Modelling Feasible Solutions

Performance of multi domain web search applications is typically hindered by the availability and accessibility of the web data sources. In this paper we consider web data materialization as a solution. The web data services are modelled via binding schema patterns – access patterns – thereby defining input and output dependencies between the participating data sources. Web materialization is formulated as a set of interdependent blocks, each being a deciding factor in formulating an obtainable materialization. In this work consideration is given to the feasibility of the proposed set of web sources for the given materialization task. The model for analysing the feasible materialization solution in terms of reachability and bound is established. To demonstrate the effectiveness of such a feasibility analysis model, an empirical study is performed on a set of materialization tasks ranging in their schema dependency complexity.

Srđan Zagorac, Russel Pears

Repositories

Efficient R-Tree Based Indexing for Cloud Storage System with Dual-Port Servers

Cloud storage system such as Amazon’s Dynamo and Google’s GFS poses new challenges to the community to support efficient query processing for various applications. In this paper we propose RT-HCN, a distributed indexing scheme for multi-dimensional query processing in

data centers

, the infrastructure to build cloud systems. RT-HCN is a two-layer indexing scheme, which integrates HCN-based routing protocol and the R-Tree based indexing technology, and is portionably distributed on every server. Based on the characteristics of HCN, we design a special index publishing rule and query processing algorithms to guarantee efficient data management for the whole network. We prove theoretically that RT-HCN is both query-efficient and space-efficient, by which each server will only maintain a tolerable number of indices while a large number of users can concurrently process queries with low routing cost. We compare our design with RT-CAN, a similar design in traditional P2P network. Experiments validate the efficiency of our proposed scheme and depict its potential implementation in data centers.

Fan Li, Wanchao Liang, Xiaofeng Gao, Bin Yao, Guihai Chen

Design and Evaluation of a Storage Framework for Supply Chain Management

Radio-frequency identification (RFID) technology is widely available and actively used in business areas such as supply chain management. We propose a storage framework for a supply chain management system to administer the movement and time history of a product with an RFID tag. The storage structure is adapted to handle movement and time history data based on an encoding scheme for multidimensional datasets. Three types of encoded value compactly keep all the routing path and temporal information of a product from factory shipment, and the information can be quickly traced by using these encoded values. This paper describes our storage framework for supply chain management and evaluates the database constructed based on the encoding scheme.

Tatsuo Tsuji, Ryota Shimono, Ken Higuchi

On Using Requirements Throughout the Life Cycle of Data Repository

Requirements engineering aims at providing a requirement specification with some nice properties such as completeness or accuracy. In the lifecycle of a Data Repository (

$\mathcal{D}\mathcal{R}$

), user requirements are usually assumed to be homogenous and used mainly to define the conceptual model of a

$\mathcal{D}\mathcal{R}$

. In this paper, we study the interest of the requirements in the other phases of the life cycle of a

$\mathcal{D}\mathcal{R}$

. We propose a generic model based on ontologies to unify the used vocabularies and requirements languages. Then we extend this model using the formal method B to check the consistency of the requirements w.r.t. the integrity constraints defined on the logical schema. Finally we propose to select optimization structures of a

$\mathcal{D}\mathcal{R}$

using the user requirements instead of SQL queries. Several experiments on the Star Schema Benchmark (SSB) confirm the interest of our proposition.

Stéphane Jean, Idir Ait-Sadoune, Ladjel Bellatreche, Ilyès Boukhari

Partitioning

Inter-Data-Center Large-Scale Database Replication Optimization – A Workload Driven Partitioning Approach

Inter-data-center asynchronous middleware replication between active-active databases has become essential for achieving continuous business availability. Near real-time replication latency is expected despite intermittent peaks in transaction volumes. Database tables are divided for replication across multiple parallel replication consistency groups; each having a maximum throughput capacity, but doing so can break transaction integrity. It is often not known which tables can be updated by a common transaction. Independent replication also requires balancing resource utilization and latency objectives. Our work provides a method to optimize replication latencies, while minimizing transaction splits among a minimum of parallel replication consistency groups. We present a two-staged approach: a log-based workload discovery and analysis and a history-based database partitioning. The experimental results from a real banking batch workload and a benchmark OLTP workload demonstrate the effectiveness of our solution even for partitioning 1000s of database tables for very large workloads.

Hong Min, Zhen Gao, Xiao Li, Jie Huang, Yi Jin, Serge Bourbonnais, Miao Zheng, Gene Fuh

A First Attempt to Computing Generic Set Partitions: Delegation to an SQL Query Engine

Partitions are a very common and useful way of organizing data, in data engineering and data mining. However, partitions currently lack efficient and generic data management functionalities. This paper proposes advances in the understanding of this problem, as well as elements for solving it. We formulate the task as efficient processing, evaluating and optimizing queries over set partitions, in the setting of relational databases. We first demonstrate that there is no trivial relational modeling for managing collections of partitions. We formally motivate a relational encoding and show that one cannot express all the operators of the partition lattice and set-theoretic operations as queries of the relational algebra. We provide multiple evidence of the inefficiency of

FO

queries. Our experimental results enforce this evidence. We claim that there is a strong requirement for the design of a dedicated system to manage set partitions, or at least to supplement an existing data management system, to which both data persistence and query processing could be delegated.

Frédéric Dumonceaux, Guillaume Raschia, Marc Gelgon

Algebra-Based Approach for Incremental Data Warehouse Partitioning

Horizontal Data Partitioning is an optimization technique well suited to optimize star-join queries in Relational Data Warehouses. Most works focus on a static selection of a fragmentation schema. However, due to the evolution of data warehouses and the ad hoc nature of queries, the development of incremental algorithms for fragmentation schema selection has become a necessity. In this work, we present a Partitioning Algebra containing all operators needed to update a schema when a new query arrives. To identify queries which should trigger a schema update, we introduce the notion of query profiling.

Rima Bouchakri, Ladjel Bellatreche, Zoé Faget

Business Applications

Multi-dimensional Sentiment Analysis for Large-Scale E-commerce Reviews

E-commerce reviews reveal the customers’ attitudes on the products, which are very helpful for customers to know other people’s opinions on interested products. Meanwhile, producers are able to learn the public sentiment on their products being sold in E-commerce platforms. Generally, E-commerce reviews involve many aspects of products, e.g., appearance, quality, price, logistics, and so on. Therefore, sentiment analysis on E-commerce reviews has to cope with those different aspects. In this paper, we define each of those aspects as a dimension of product, and present a multi-dimensional sentiment analysis approach for E-commerce reviews. In particular, we employ a sentiment lexicon expanding mechanism to remove the word ambiguity among different dimensions, and propose an algorithm for sentiment analysis on E-commerce reviews based on rules and a dimensional sentiment lexicon. We conduct experiments on a large-scale dataset involving over 28 million reviews, and compare our approach with the traditional way that does not consider dimensions of reviews. The results show that the multi-dimensional approach reaches a precision of 95.5% on average, and outperforms the traditional way in terms of precision, recall, and F-measure.

Lizhou Zheng, Peiquan Jin, Jie Zhao, Lihua Yue

The Entity-Flow Perspective in Business Process Models

The research reported in this paper is grounded on the principles of the artifact-centric approach and stresses the importance of the entity-flow perspective in business process models with the help of a notation named ACTA. ACTA process models are based on a number of entity-flow patterns, which help readers understand the operation of the process before looking at the task descriptions. In particular, these patterns emphasize choice situations such as those related to the grouping or matching of input entities.

Giorgio Bruno

Projections of Abstract Interorganizational Business Processes

Distributed interorganizational business processes are constituted by cooperating private processes of different partners. These private processes must be aligned to fulfill the overall goal of the (virtual) interorganizational process. Process views are considered as an excellent concept for modeling such interorganizational processes by providing a good balance between the information needs and privacy requirements of the partners. We follow a top-down development approach, where first an abstract global process is designed. Then each task of the global process is assigned to one of the partners. Finally, a partitioning procedure is applied to generate views of the global process for each partner. The partners can then develop or adopt their private processes based on their views. We present a novel algorithm for the automatic partitioning of any block-structured process with any arbitrary assignment of partners and have proven its correctness.

Julius Köpke, Johann Eder, Markus Künstner

Springer Professional

About this book

Table of Contents

Frontmatter

Social Computing

A Unified Semi-supervised Framework for Author Disambiguation in Academic Social Network

Designing Incentives for Community-Based Mobile Crowdsourcing Service Architecture

Arguing Prism: An Argumentation Based Approach for Collaborative Classification in Distributed Environments

Similarity Search

Rank Aggregation of Candidate Sets for Efficient Similarity Search

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Fast Phonetic Similarity Search over Large Repositories

Ranking

Named Entity Oriented Related News Ranking

Semantic Path Ranking Scheme for Relational Keyword Queries

Fraud Indicators Applied to Legal Entities: An Empirical Ranking Approach

Data Mining

A Probabilistic Approach to Detect Local Dependencies in Streams

Inferring Topic-Level Influence from Network Data

A Synergy of Artificial Bee Colony and Genetic Algorithms to Determine the Parameters of the Σ-Gram Distance

Web Application Relations Mapping

Towards Practical Anomaly-Based Intrusion Detection by Outlier Mining on TCP Packets

Big Data

Survey on Big Data and Decision Support Benchmarks

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Optimization Strategies for Column Materialization in Parallel Execution of Queries

Evaluating Cassandra Scalability with YCSB

Approximations

An Improved Method for Efficient PageRank Estimation

An Enhanced Genetic Algorithm for Web Service Location-Allocation

Ascertaining Spam Web Pages Based on Ant Colony Optimization Algorithm

Reducing Redundant Information in Search Results Employing Approximation Algorithms

Privacy

Dynamic Privacy Policy Management in Services-Based Interactions

Ephemeral UUID for Protecting User Privacy in Mobile Advertisements

Voronoi-Based Spatial Cloaking Algorithm over Road Network

Data Exchange

Unique Solutions in Data Exchange

Translatable Updates of Selection Views under Constant Complement

Data Integration

Data Integration by Conceptual Diagrams

DatalogBlocks: Relational Logic Integration Patterns

Secure Data Integration: A Formal Concept Analysis Based Approach

Web Semantics

A Lateral Thinking Framework for Semantic Modelling of Emergencies in Smart Cities

An Approach for Automatic Query Expansion Based on NLP and Semantics

Mapping Construction Compliant with Schema Semantics

Web Materialization Formulation: Modelling Feasible Solutions

Repositories

Efficient R-Tree Based Indexing for Cloud Storage System with Dual-Port Servers

Design and Evaluation of a Storage Framework for Supply Chain Management

On Using Requirements Throughout the Life Cycle of Data Repository

Partitioning

Inter-Data-Center Large-Scale Database Replication Optimization – A Workload Driven Partitioning Approach

A First Attempt to Computing Generic Set Partitions: Delegation to an SQL Query Engine

Algebra-Based Approach for Incremental Data Warehouse Partitioning

Business Applications

Multi-dimensional Sentiment Analysis for Large-Scale E-commerce Reviews

The Entity-Flow Perspective in Business Process Models

Projections of Abstract Interorganizational Business Processes

Backmatter

Premium Partner