Skip to main content

2015 | Buch

Database and Expert Systems Applications

26th International Conference, DEXA 2015, Valencia, Spain, September 1-4, 2015, Proceedings, Part II

herausgegeben von: Qiming Chen, Abdelkader Hameurlain, Farouk Toumani, Roland Wagner, Hendrik Decker

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two volume set LNCS 9261 and LNCS 9262 constitutes the refereed proceedings of the 26th International Conference on Database and Expert Systems Applications, DEXA 2015, held in Valencia, Spain, September 1-4, 2015.

The 40 revised full papers presented together with 32 short papers, and 2 keynote talks, were carefully reviewed and selected from 125 submissions. The papers discuss a range of topics including: temporal, spatial and high dimensional databases; semantic Web and ontologies; modeling, linked open data; NoSQLm NewSQL, data integration; uncertain data and inconsistency tolerance; database system architecture; data mining, query processing and optimization; indexing and decision support systems; modeling, extraction, social networks; knowledge management and consistency; mobility, privacy and security; data streams, Web services; distributed, parallel and cloud databases; information retrieval; XML and semi-structured data; data partitioning, indexing; data mining, applications; WWW and databases; data management algorithms.

These volumes also include accepted papers of the 8th International Conference on Data Management in Cloud, Grid and P2P Systems, Globe 2015, held in Valencia, Spain, September 2, 2015. The 8 full papers presented were carefully reviewed and selected from 13 submissions. The papers discuss a range of topics including: MapReduce framework: load balancing, optimization and classification; security, data privacy and consistency; query rewriting and streaming.

Inhaltsverzeichnis

Frontmatter

Knowledge Management and Consistency

Frontmatter
A Logic Based Approach for Restoring Consistency in P2P Deductive Databases

This paper stems from the work in [

10

] in which the declarative semantics of a P2P system is defined in terms of minimal weak models. Under this semantics each peer uses its mapping rules to import minimal sets of mapping atoms allowing to satisfy its local integrity constraints. This behavior results to be useful in real world P2P systems in which peers often use the available import mechanisms to extract knowledge from the rest of the system only if this knowledge is strictly needed to repair an inconsistent local database. Then, an inconsistent peer, in the interaction with different peers, just imports the information allowing to restore consistency, that is

minimal sets of atoms allowing the peer to enrich its knowledge so that restoring inconsistency anomalies

. The paper extends previous work by proposing a rewriting technique that allows modeling a P2P system,

$${\mathcal{{PS}}}$$

, as a unique logic program whose minimal models correspond to the minimal weak models of

$${\mathcal{{PS}}}$$

.

Luciano Caroprese, Ester Zumpano
Expert System with Web Interface Based on Logic of Plausible Reasoning

The paper presents an expert system based on Logic of Plausible Reasoning (LPR). This formalism reflects human ways of knowledge representation and reasoning. The knowledge is modeled using several kinds of formulas representing statements, hierarchies, similarities, dependencies and implications. Several types of inference patterns are defined. Knowledge uncertainty can be modeled. The paper is structured as follows. Research related to LPR is presented. Next, the formalism is introduced and a Web-based application, which was developed for this research, is described. Finally, a case study is presented – a prototype expert system which recommends a material and a technology for a casting process.

Grzegorz Legień, Bartłomiej Śnieżyński, Dorota Wilk-Kołodziejczyk, Stanisława Kluska-Nawarecka, Edward Nawarecki, Krzysztof Jaśkowiec
Extending Knowledge-Based Profile Matching in the Human Resources Domain

In the Human Resources domain the accurate matching between job positions and job applicants profiles is crucial for job seekers and recruiters. The use of recruitment taxonomies has proven to be of significant advantage in the area by enabling semantic matching and reasoning. Hence, the development of Knowledge Bases (KB) where curricula vitae and job offers can be uploaded and queried in order to obtain the best matches by both, applicants and recruiters is highly important. We introduce an approach to improve matching of profiles, starting by expressing jobs and applicants profiles by filters representing skills and competencies. Filters are used to calculate the similarity between concepts in the subsumption hierarchy of a KB. This is enhanced by adding weights and aggregates on filters. Moreover, we present an approach to evaluate over-qualification and introduce blow-up operators that transform certain role relations such that matching of filters can be applied.

Alejandra Lorena Paoletti, Jorge Martinez-Gil, Klaus-Dieter Schewe
Sensitive Business Process Modeling for Knowledge Management

Currently, modern organizations are characterized by collaborative, highly dynamic, complex and highly intensive knowledge processes. They are all the more aware of the need to effectively identify, preserve, share and use the knowledge mobilized by their business processes. Thus, in order to improve their performance, business process modeling has become a primary concern for any organization in order to improve the management of its individual and collective knowledge assets. This paper proposes a new meta-model of sensitive business processes modeling for knowledge management, called BPM4KI (Business Process Meta-Model for Knowledge Identification) based on COOP, a core ontology of organization’s processes. The aim of this meta-model is to help identify and localize the crucial knowledge that is mobilized and created by these processes. Moreover, it has been illustrated through applying it to a medical process in the context of the organization of protection of the motor disabled people of Sfax-Tunisia (ASHMS).

Mariam Ben Hassen, Mohamed Turki, Faïez Gargouri

Mobility, Privacy and Security

Frontmatter
Partial Order Preserving Encryption Search Trees

As Internet services expand and proliferate, service users’ data show an increase in volume as well as geographical dispersion mainly due to the large number of personalized services that users often access today on a daily basis. This fact, however, presents a user privacy and user data security challenge for service providers: how to protect theirs users’ data from unauthorized access. In this paper we present a new tree-based data structure for storing encrypted information in order to support fast search, update, and delete operations on the encrypted data. The data structure relies on exposing limited ordering information of the data in order to locate them fast. After showing that a totally order preserving encryption scheme is not secure, we describe a new tree data structure and assess its security and efficiency.

Kyriakos Ispoglou, Christos Makris, Yannis C. Stamatiou, Elias C. Stavropoulos, Athanasios K. Tsakalidis, Vasileios Iosifidis
mobiSurround: An Auditory User Interface for Geo-Service Delivery

This paper describes original research carried out in the area of Location-Based Services (LBS) with an emphasis on Auditory User Interfaces (AUI) for content delivery. Previous work in this area has focused on accurately determining spatial interactions and informing the user mainly by means of the visual modality.

mobiSurround

is new research that builds upon these principles with a focus on multimodal content delivery and navigation and in particular the development of an AUI. This AUI enables the delivery of rich media content and natural directions using audio. This novel approach provides a hands free method for navigating a space while experiencing rich media content dynamically constructed using techniques such as phrase synthesis, algorithmic music and 3D soundscaping. This paper outlines the innovative ideas employed in the design and development of the AUI that provides an overall immersive user experience.

Keith Gardiner, Charlie Cullen, James D. Carswell
A Diversity-Seeking Mobile News App Based on Difference Analysis of News Articles

To support the efficient gathering of diverse information about a news event, we propose a diversity-seeking mobile news app on smart mobile devices. At first, by extending our previous work, based on the entity-oriented news analysis, we propose three measures for searching and ranking news articles from perspectives of difference in opinions, difference in details, and difference in factor coverages. Then, by utilizing these measures, we develop a news app on mobile devices to help users to acquire diverse reports for improving news understanding. One of the notable features of our system is a context-aware re-ranking method for enhancing the diversity of news reports presented to users by considering the access history. The experimental results demonstrate the efficiency of our methods.

Keisuke Kiritoshi, Qiang Ma
KUR-Algorithm: From Position to Trajectory Privacy Protection in Location-Based Applications

Some obfuscation techniques may fail to protect user privacy because of the moving context as well as the background knowledge of adversaries. In this paper, we propose a novel scheme to distinctly protect user privacy not only from user position but also from user trajectory. Furthermore, we present k

UR

-algorithm, which is context-aware and can be employed as either an independent method or a supportive technique, to give the high-level user privacy protection against privacy disclosure and privacy leak. Last but not least, we analyse other potential privacy problems which usually emerge as outliers and show how well our proposed solution overcomes these scenarios.

Trong Nhan Phan, Josef Küng, Tran Khanh Dang

Data Streams, Web Services

Frontmatter
Candidate Pruning Technique for Skyline Computation Over Frequent Update Streams

Skyline query processing reveals a set of preferable results based on the competitiveness of many criteria among all data objects. This is a very useful query for multi-attribute decision making. Moreover, monitoring and tracing skyline over time-series data are also important not only for real-time applications (e.g., environmental monitoring) but also historical time-series analysis (e.g., sports archives, historical stock data). In these applications, considering consecutive snapshots, a large fraction of the fixed number of observing objects (e.g., weather stations) can change their values resulting to the possibility of complete change in the previous skyline. Without any technique, computing skyline from a scratch is unavoidable and can be outperformed some traditional skyline update methods. In this paper, we propose an efficient method to compute skyline sets over data update streams. Our proposed method uses bounding boxes to summarize consecutive data updates of each data object. This technique enables the pruning capability to identify a smaller set of candidates in skyline computation resulting in faster total computation time. We conduct some experiments through both synthetic and real-life datasets. The results explicitly show that our proposed method significantly runs faster than the baseline in various parameter studies.

Kamalas Udomlamlert, Takahiro Hara, Shojiro Nishio
Mining Frequent Closed Flows Based on Approximate Support with a Sliding Window over Packet Streams

Due to the varying and dynamic characteristics of network traffic, the analysis of traffic flows is of paramount importance for network security. In this context, the main challenge consists in mining the traffic flows with high accuracy and limited memory consumption. In this respect, we introduce a novel algorithm, which mines the approximate closed frequent patterns over a stream of packets within a sliding window model. The latter is based on

a relaxation rate

parameter as well as an

approximate support

concept. Our experiment results show the robustness and efficiency of our new algorithm against those in the literature.

Imen Brahmi, Hanen Brahmi, Sadok Ben Yahia
D-FOPA: A Dynamic Final Object Pruning Algorithm to Efficiently Produce Skyline Points Over Data Streams

Emerging technologies that support sensor networks are making available large volumes of

sensed data

. Commonly, sensed data entries are characterized by several

sensed attributes

and can be related to

static attributes

; both sensed and static attributes can be represented using Vertically Partitioned Tables (VPTs).

Skyline-based

ranking techniques provide the basis to distinguish the entries that best meet a user condition, and allow for the pruning of the space of potential answers. We tackle the problem of efficiently computing the skyline over sensed and static data represented as Vertically Partitioned Tables (VPTs). We propose an algorithm named D-FOPA (Dynamic Final Object Pruning Algorithm), a rank-based approach able to dynamically adjust the skyline by processing changes on values of sensed attributes. We conducted an empirical study on datasets of synthetic sensed data, and the results suggest that D-FOPA is not only able to scale up to large datasets, but reduces average execution time and number of comparisons of state-of-the-art approaches by up to one order of magnitude.

Stephanie Alibrandi, Sofia Bravo, Marlene Goncalves, Maria-Esther Vidal
GraphEvol: A Graph Evolution Technique for Web Service Composition

Web service composition can be thought of as the combination of reusable functionality modules available over the network to create applications that accomplish more complex tasks, and Evolutionary Computation (EC) techniques have been applied with success to this problem. Genetic Programming (GP) is a traditionally employed EC technique in this domain, and it encodes solutions as trees instead of their natural Directed Acyclic Graph (DAG) form. This complicates the enforcement of dependencies between service nodes, which is much easier to accomplish in a DAG. To overcome this we propose GraphEvol, an evolutionary technique that uses DAGs directly to represent and evolve Web service composition solutions. GraphEvol is analogous to GP, but it implements the mutation and crossover operators differently. Experiments were carried out comparing GraphEvol with GP for a series of composition tasks, with results showing that GraphEvol solutions either match or surpass the quality of those obtained using GP, at the same time relying on a more intuitive representation.

Alexandre Sawczuk da Silva, Hui Ma, Mengjie Zhang

Distributed, Parallel and Cloud Databases

Frontmatter
Can Data Integration Quality Be Enhanced on Multi-cloud Using SLA?

This paper identifies trends and open issues regarding the use of SLA in data integration solutions on multi-cloud environments. Therefore it presents results of a Systematic Mapping [

3

] that analyzes the way SLA, data integration and multi-cloud environments are correlated in existing works. The main result is a classification scheme consisting of facets and dimensions namely (i) data integration environment (cloud; data warehouse; federated database; multi-cloud); (ii) data integration description (knowledge; metadata; schema); and (iii) data quality (confidentiality; privacy; security; SLA; data protection; data provenance). The proposed classification scheme is used to organize a collection of representative papers and discuss the numerical analysis about research trends in the domain.

Daniel A. S. Carvalho, Plácido A. Souza Neto, Genoveva Vargas-Solar, Nadia Bennani, Chirine Ghedira
An Efficient Gear-Shifting Power-Proportional Distributed File System

Recently, power-aware distributed file systems for efficient big data processing have increasingly moved toward power proportional designs. However, inefficient gear-shifting in such systems is an important issue that can seriously degrade their performance. To address this issue, we propose and evaluate an efficient gear-shifting power proportional distributed file system. The proposed system utilizes flexible data placement that reduces the amount of reflected data and has an architecture that improves the metadata management to achieve high-efficiency gear-shifting. Extensive empirical experiments using actual machines based on the HDFS demonstrated that the proposed system gains up to

$$22\,\%$$

better throughput-per-watt performance. Moreover, a suitable metadata management setting corresponding to the amount of data updated while in low gear is found from the experimental results.

Hieu Hanh Le, Satoshi Hikida, Haruo Yokota
Highly Efficient Parallel Framework: A Divide-and-Conquer Approach

Coupling a database and a parallel-programming framework reduces the I/O overhead between them. However, there will be serious issues such as memory bandwidth limitations, load imbalances, and race conditions. Existing frameworks such as MapReduce do not resolve these problems because they adopt

flat

parallelization, i.e., partitioning a task without regard to its structure. In this paper, we propose a recursive divide-and-conquer-based method for spatial databases which supports high-throughput machine learning. Our approach uses a tree-based task structure, which improves the reference locality, and load balancing is realized by setting the grain size of tasks dynamically. Race conditions are also avoided. We applied our method to the task of learning a hierarchical Poisson mixture model. The results show that our approach achieves strong scalability and robustness against load-imbalanced datasets.

Takaya Kawakatsu, Akira Kinoshita, Atsuhiro Takasu, Jun Adachi
Ontology-Driven Data Partitioning and Recovery for Flexible Query Answering

Flexible Query Answering helps users find relevant information to their queries even if no exactly matching answers can be found in a database system. However, relaxing query conditions at runtime is inherently slow and does not scale as the data set grows. In this paper we propose a method to partition the data by using an ontology that semantically guides the query relaxation. Moreover, if several different partitioning strategies are applied in parallel, a lookup table is maintained in order to recover the ontology-driven partitioning in case of data loss or server failure. We tested performance of the partitioning and recovery strategy with a distributed SAP HANA database.

Lena Wiese

Information Retrieval

Frontmatter
Detecting Near-Duplicate Documents Using Sentence Level Features

In Web search engines, digital libraries and other types of online information services, duplicates and near-duplicates may cause severe problems if unaddressed. Typical problems include more space needed than necessary, longer indexing time and redundant results presented to users. In this paper, we propose a method of detecting near-duplicate documents. Two sentence level features, number of terms and terms at particular positions, are used in the method. Suffix tree is used to match sentence blocks very efficiently. Experiments are carried out to compare our method with two other representative methods and show that our method is effective and efficient. It has potential to be used in practice.

Jinbo Feng, Shengli Wu
A Dwell Time-Based Technique for Personalised Ranking Model

The aim of a Personalised Ranking Model (PRM) is to filter the top-k set of documents from a number of relevant documents matching the search query. Dwell times of previously clicked results have been shown to be valuable for estimating documents’ relevance. The indexing structure of the dwell time is an important parameter. We propose a dwell time-based scoring scheme called

Dwell

-

tf

-

idf

to index text and non-text data, based on which search results are ranked. The effectiveness of incorporating into the ranking process the proposed

Dwell

-

tf

-

idf

scheme is validated by a controlled experiment which shows a significant improvement in the search results within the top-k rank.

Safiya Al-Sharji, Martin Beer, Elizabeth Uruchurtu
An Evaluation of Diversification Techniques

Diversification is a method of improving user satisfaction by increasing the variety of information shown to user. Due to the lack of a precise definition of information variety, many diversification techniques have been proposed. These techniques, however, have been rarely compared and analyzed under the same setting, rendering a ‘right’ choice for a particular application very difficult. Addressing this problem, this paper presents a benchmark that offers a comprehensive empirical study on the performance comparison of diversification. Specifically, we integrate several state-of-the-art diversification algorithms in a comparable manner, and measure distinct characteristics of these algorithms with various settings. We then provide in-depth analysis of the benchmark results, obtained by using both real data and synthetic data. We believe that the findings from the benchmark will serve as a practical guideline for potential applications.

Duong Chi Thang, Nguyen Thanh Tam, Nguyen Quoc Viet Hung, Karl Aberer

XML and Semi-structured Data

Frontmatter
TOIX: Temporal Object Indexing for XML Documents

Managing temporal data gets increased demand for many significant areas such as scientific and financial applications. This paper proposes a new index (TOIX), which is designed in order to provide more efficient evaluation of temporal queries on XML documents. The index lies on mapping the twigs into temporal objects and then using these objects instead for answering the query. An improvement of the naive algorithm using a B-tree as well. Finally, a set of conducted experiments were performed and showed that our proposed algorithm outperforms the state of the art indexing algorithms in certain cases.

Rasha Bin-Thalab, Neamat El-Tazi
Expressing and Processing Path-Centric XML Queries

A family of practical queries, which aim to return or manipulate paths as first class objects, cannot be expressed by XPath or XQuery FLWOR expressions. In this paper, we propose a seamless extension to XQuery FLWOR to elegantly express path-centric queries. We further investigate the expression and processing of intra-path aggregation, an analytical operation in path-centric queries.

Huayu Wu, Dongxu Shao, Ruiming Tang, Tok Wang Ling, Stéphane Bressan
A Logical Framework for XML Reference Specification

In this paper we focus on a (as much as possible) simple logic, called

$${\mathsf {XHyb}}$$

, expressive enough to allow the specification of the most common integrity constraints in XML documents. In particular we will deal with constraints on ID and IDREF(S) attributes, which are the common way of logically connecting parts of XML documents, besides the usual containment relation of XML elements.

C. Combi, A. Masini, B. Oliboni, M. Zorzi
XQuery Testing from XML Schema Based Random Test Cases

In this paper we present the elements of an XQuery testing tool which makes possible to automatically test XQuery programs. The tool is able to systematically generate XML instances (i.e., test cases) from a given XML schema. The number and type of instances is defined by the human tester. These instances are used to execute the given XQuery program. In addition, the tool makes possible to provide an user defined property to be tested against the output of the XQuery program. The property can be specified with a Boolean XQuery function. The tool is implemented as an oracle able to report whether the XQuery program passes the test, that is, all the test cases satisfy the property, as well as the number of test cases used for testing. In the case of the XQuery program fails the testing, the tool shows counterexamples found in the test cases. The tool has been implemented as an XQuery library which makes possible to be used from any XQuery interpreter.

Jesús M. Almendros-Jiménez, Antonio Becerra-Terón

Data Partitioning, Indexing

Frontmatter
Grid-File: Towards to a Flash Efficient Multi-dimensional Index

Spatial indexes are of great importance for multidimensional query processing. Traditional data structures have been optimized for magnetic disks in the storage layer. In the recent years flash solid disks are widely utilized, as a result of their exceptional features. However, the specifics of flash memory (asymmetric read/write speeds erase before update, wear out) introduce new challenges. Algorithms and data structures designed for magnetic disks experience reduced performance in flash. Most research efforts for flash-aware spatial indexes concern R-tree and its variants. Distinguishing from previous works we investigate the performance of Grid File in flash and enlighten constrains and opportunities towards a flash efficient Grid File. We conducted experiments on mainstream and high performance SSD devices and Grid File outperforms R

$$^*$$

-tree in all cases.

Athanasios Fevgas, Panayiotis Bozanis
Supporting Fluctuating Transactional Workload

This work deals with a fluctuating workload as in social applications where users interact each other in a temporary fashion. The data on which a user group focuses form a

bundle

and can cause a peak if the frequency of interactions as well as the number of users is high. To manage such a situation, one solution is to partition data and/or to move them to a more powerful machine while ensuring consistency and effectiveness. However, two problems may be raised such as how to partition data in a efficient way and how to determine which part of data to move in such a way that data are located on one single site. To achieve this goal, we track the bundles formation and their evolution and measure their related load for two reasons: (1) to be able to partition data based on how they are required by user interactions; and (2) to assess whether a machine is still able of executing transactions linked to a bundle with a bounded latency. The main gain of our approach is to minimize the number of machines used while maintaining low latency at a low cost.

Ibrahima Gueye, Idrissa Sarr, Hubert Naacke, Joseph Ndong
Indexing Multi-dimensional Data in Modular Data Centers

Providing efficient multi-dimensional indexing is critically important to improve the overall performance of the cloud storage system. To achieve efficient querying service, the indexing scheme should guarantee lower routing cost and less false positive. In this paper, we propose RB-Index, a distributed multi-dimensional indexing scheme in modular data centers with Bcube topology. RB-Index is a two-layer indexing scheme, which integrates Bcube-based routing protocol and R-tree-based indexing technology. In its lower layer, each server in the network indexes the local data with R-tree, while in the upper layer the global index is distributed across different servers in the network. Based on the characteristics of Bcube, we build several indexing spaces and propose the way to map servers into the indexing spaces. The dimension of these indexing spaces are dynamically selected according to both the data distribution and the query habit. Index construction and query algorithms are also introduced. We simulate a three-level Bcube to evaluate the efficiency of our indexing scheme and compare the performance of RB-Index with RT-CAN, a similar design in P2P network.

Libo Gao, Yatao Zhang, Xiaofeng Gao, Guihai Chen

Data Mining IV, Applications

Frontmatter
A Modified Tripartite Model for Document Representation in Internet Sociology

Seven years ago Peter Mika (Yahoo! Research) proposed a tripartite model of actors, concepts and instances for document representation in the study of social networks. We propose a modified model, where instead of document authors we consider textual mentions of persons and institutions as actors. This representation proves to be more appropriate for the solution of a range of Internet Sociology tasks. In the paper we describe experiments with the modified model and provide some background on the tools that can be used to build it. The model is tested on the experimental corpora of Russian news (educational domain). The research reflects the pilot study findings.

Mikhail Alexandrov, Vera Danilova, Xavier Blanco
Improving Financial Time Series Prediction Through Output Classification by a Neural Network Ensemble

One topic of great interest in the literature is time series prediction. This kind of prediction, however, does not have to provide the exact future values every time: in some cases, knowing only time series future tendency is enough. In this paper, we propose a neural network ensemble that receives as input the last values from a time series and returns not its future values, but a prediction that indicates whether the next value will raise or fall down. We perform exhaustive experiments to analyze our method by using time series extracted from the North American stock market, and evaluate the hit rate and amount of profit that could be obtained by performing the operations recommended by the system. Evaluation results show capital increases up to 56 %.

Felipe Giacomel, Adriano C. M. Pereira, Renata Galante
Mining Strongly Correlated Intervals with Hypergraphs

Correlation is an important statistical measure for estimating dependencies between numerical attributes in multivariate datasets. Previous correlation discovery algorithms mostly dedicate to find piecewise correlations between the attributes. Other research efforts, such as correlation preserving discretization, can find strongly correlated intervals through a discretization process while preserving correlation. However, discretization based methods suffer from some fundamental problems, such as information loss and crisp boundary. In this paper, we propose a novel method to discover strongly correlated intervals from numerical datasets without using discretization. We propose a hypergraph model to capture the underlying correlation structure in multivariate numerical data and a corresponding algorithm to discover strongly correlated intervals from the hypergraph model. Strongly correlated intervals can be found even when the corresponding attributes are less or not correlated. Experiment results from a health social network dataset show the effectiveness of our algorithm.

Hao Wang, Dejing Dou, Yue Fang, Yongli Zhang

WWW and Databases

Frontmatter
Diagnosis Model for an Organization Based on Social Network Analysis

Human resources are one of the most important elements of an organization. Previous studies focused on analyzing the character of members based on sociological ideas e.g. questionnaires and interviews to diagnose an organization. However, answering detailed reports is a time-consuming process, has potential for concealment, and struggles to display organizational level problems. We propose a

multi-facet diagnosis model

based on both sociological questionnaires and social network analysis to reveal relationships among organization members. We present a light-weighted questionnaire guided by psychological studies and propose SNA-based algorithms to detect specific members, such as those at-risk and leaders, which are meaningful when diagnosing an organization. Experimental results show that the proposed method covers the core of psychological diagnosis results and can observe specific members.

Dongwook Park, Soungwoong Yoon, Sanghoon Lee
Integration Method for Complex Queries Based on Hyponymy Relations

Users often find it difficult to build suitable Web search queries even if they can explain the target of their search. In this situation, users must create complex queries using Boolean search operators such as “AND” and “NOT” by trial and error. However, complex queries often narrow down search results to a great extent. Therefore, we consider that users would like to replace their queries with simple queries that can retrieve the same target as complex queries. In this paper, we propose an integration method for complex queries based on hyponymy relations. We retrieve the same target as a complex query using a keyword that represents a narrower concept of one of the keywords in a complex query. Using this method, users can use understandable, simple queries and obtain many Web search results that have the same target as the complex query.

Daisuke Kitayama, Takuma Matsumoto
An Expertise-Based Framework for Supporting Enterprise Applications Development

Currently, Web mashups are becoming more and more popular for organizations and enterprises with the aim to implement applications based on third party software components. These components may offer sophisticated functionalities and access to high valuable datasources through Web APIs. However, developing a Web mashup may require a rather specialized knowledge about specific Web APIs, their technological features and how to integrate them. If we consider a large organization, knowledge required to implement a mashup can be available, but distributed among different developers that are not easy to identify and assess. To this purpose, we propose a framework and a software tool for searching experts inside the organization that own valuable knowledge about specific Web APIs and the way to integrate them meaningfully. Retrieved experts are ranked based on: (i) the expertise level on the specific request, and (ii) the social distance with the developer that issued the request. The approach integrates knowledge both internal and external to the organization and represented as a linked data. We include a preliminary evaluation based on an implementation of the framework.

Devis Bianchini, Valeria De Antonellis, Michele Melchiori

Data Management Algorithms

Frontmatter
A Linear Program for Holistic Matching: Assessment on Schema Matching Benchmark

Schema matching is a key task in several applications such as data integration and ontology engineering. All application fields require the matching of several schemes also known as “holistic matching", but the difficulty of the problem spawned much more attention to pairwise schema matching rather than the latter. In this paper, we propose a new approach for holistic matching. We suggest modelling the problem with some techniques borrowed from the combinatorial optimization field. We propose a linear program, named LP4HM, which extends the maximum-weighted graph matching problem with different linear constraints. The latter encompass matching setup constraints, especially cardinality and threshold constraints; and schema structural constraints, especially superclass/subclass and coherence constraints. The matching quality of LP4HM is evaluated on a recent benchmark dedicated to assessing schema matching tools. Experimentations show competitive results compared to other tools, in particular for recall and HSR quality measures.

Alain Berro, Imen Megdiche, Olivier Teste
An External Memory Algorithm for All-Pairs Regular Path Problem

In this paper, we consider solving the all-pairs regular path problem on large graphs efficiently. Let

G

be a graph and

r

be a regular path query, and consider finding the answers of

r

on

G

. If

G

is so small that it fits in main memory, it suffices to load entire

G

into main memory and traverse

G

to find paths matching

r

. However, if

G

is too large and cannot fit in main memory, we need another approach. In this paper, we propose an external memory algorithm for solving all-pairs regular path problem on large graphs. Our algorithm finds the answers matching

r

by scanning the node list of

G

sequentially, which avoids random accesses to disk and thus makes regular path query processing I/O efficient.

Nobutaka Suzuki, Kosetsu Ikeda, Yeondae Kwon

Special Section Globe 2015 – MapReduce Framework : Load Balancing, Optimization and Classification

Frontmatter
An Efficient Solution for Processing Skewed MapReduce Jobs

Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers.

In this paper, we propose

FP-Hadoop

, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called

intermediate reduce (IR)

, in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g.

more than 10 times in reduce time and 5 times in total execution time

.

Reza Akbarinia, Miguel Liroz-Gistau, Divyakant Agrawal, Patrick Valduriez
MapReduce-DBMS: An Integration Model for Big Data Management and Optimization

The data volume and the multitude of sources have an exponential number of technical and application challenges. In the past, Big Data solutions have been presented as a replacement for the Parallel Database Management Systems. However, Big Data solutions can be seen as a complement to a RDBMS for analytical applications, because different problems require complex analysis capabilities provided by both technologies. The aim of his work is to integrate a Big Data solution and a classic DBMS, in a goal of queries optimization. We propose a model for OLAP queries process. Then, we valid the proposed optimized model through experiments showing the gain of the execution cost saved up.

Dhouha Jemal, Rim Faiz, Ahcène Boukorca, Ladjel Bellatreche
A Travel Time Prediction Algorithm Using Rule-Based Classification on MapReduce

Recently, the amount of trajectory data has been rapidly increasing with the popularity of LBS and the development of mobile technology. Thus, the analysis of trajectory patterns for large amounts of trajectory data has attracted much interest. To improve the quality of trajectory-based services, it is essential to predict an exact travel time for a given query on road networks. One of the typical schemes for travel time prediction is a rule-based classification method which can ensure high accuracy. However, the existing scheme is inadequate for the processing of massive data because it is designed without the consideration of distributed computing environments. To solve this problem, this paper proposes a travel time prediction algorithm using rule-based classification on MapReduce for a large amount of trajectory data. First, our algorithm generates classification rules based on the actual traffic statistics and measures adequate velocity classes for each road segment. Second, our algorithm generates a distributed index by using the grid-based map partitioning method. Our algorithm can reduces the query processing cost because it only retrieves the grid cells which contain a query region, instead of the entire road network. Furthermore, it can reduce the query processing time by estimating the travel time for each segment of a given query in a parallel way. Finally, we show from our performance analysis that our scheme performs more accurate travel time prediction than the existing algorithms.

HyunJo Lee, Seungtae Hong, Hyung Jin Kim, Jae-Woo Chang

Special Section Globe 2015 – Security, Data Privacy and Consistency

Frontmatter
Numerical SQL Value Expressions Over Encrypted Cloud Databases

Cloud databases often need client-side encryption. Encryption however impairs queries, especially with numerical SQL value expressions. Fully homomorphic encryption scheme could suffice, but known schemes remain impractical. Partially homomorphic encryption suffices for specific expressions only. The additively homomorphic Paillier scheme appears the most practical. We propose the homomorphic encryption for standard SQL expressions over a practical domain of positive values. The scheme uses a version of Paillier’s formulae and auxiliary tables at the cloud that are conceptually the traditional mathematical tables. They tabulate encrypted log and antilog functions and some others over the domain. The choice of functions is extensible. We rewrite the expressions with any number of SQL operators ‘*’, ‘/’ ‘^’ and of standard aggregate functions so they compute over encrypted data using the tables and Paillier’s formulae only. All calculations occur at the cloud. We present our scheme, show its security, variants and practicality.

Sushil Jajodia, Witold Litwin, Thomas Schwarz
A Privacy-Aware Framework for Decentralized Online Social Networks

Online social networks based on a single service provider suffer several drawbacks, first of all the privacy issues arising from the delegation of user data to a single entity. Distributed online social networks (DOSN) have been recently proposed as an alternative solution allowing users to keep control of their private data. However, the lack of a centralized entity introduces new problems, like the need of defining proper privacy policies for data access and of guaranteeing the availability of user’s data when the user disconnects from the social network. This paper introduces a privacy-aware support for DOSN enabling users to define a set of privacy policies which describe who is entitled to access the data in their social profile. These policies are exploited by the DOSN support to decide the re-allocation of the profile when the user disconnects from the social network. The proposed approach is validated through a set of simulations performed on real traces logged from Facebook.

Andrea De Salve, Paolo Mori, Laura Ricci
CaLibRe: A Better Consistency-Latency Tradeoff for Quorum Based Replication Systems

In Multi-writer, Multi-reader systems, data consistency is ensured by the number of replica nodes contacted during read and write operations. Contacting a sufficient number of nodes in order to ensure data consistency comes with a communication cost and a risk to data availability. In this paper, we describe an enhancement of a consistency protocol called LibRe, which ensures consistency by contacting a minimum number of replica nodes. Porting the idea of achieving consistent reads with the help of a registry information from the original protocol, the enhancement integrate and distribute the registry inside the storage system in order to achieve better performance.

We propose an initial implementation of the model inside the Cassandra distributed data store and the performance of LibRe incarnation is benchmarked against Cassandra’s native consistency options ONE, ALL and QUORUM. The test results prove that using LibRe protocol, an application would experience a similar number of stale reads compared to strong consistency options offered by Cassandra, while achieving lower latency and similar availability.

Sathiya Prabhu Kumar, Sylvain Lefebvre, Raja Chiky, Eric Gressier-Soudan

Special Section Globe 2015 – Query Rewriting and Streaming

QTor: A Flexible Publish/Subscribe Peer-to-Peer Organization Based on Query Rewriting

Peer-to-peer publish/subscribe architectures are an interesting support for scalable distributed data stream applications. Most approaches, often based on brokers, have a static organization which is not much adaptive to different configurations of the participants’ capacities. We present

QTor

(

Query Torrent

) a generic organization that enables dynamic adaptation providing a continuum from centralized to fully decentralized solutions. Based on query rewriting and equivalence,

QTor

proposes a definition of

communities

and their relations that decouples the logical and physical aspects of the problem, while efficiently reducing organizational and functional costs.

Sébastien Dufromentel, Sylvie Cazalens, François Lesueur, Philippe Lamarre
Model for Performance Analysis of Distributed Stream Processing Applications

Nowadays, a lot of data is produced every second and it needs to be processed immediately. Processing such unbounded streams of data is often applied in a distributed environment in order to achieve high throughput. There is a challenge to predict the performance-related characteristics of such applications. Knowledge of these properties is essential for decisions about the amount of needed computational resources, how the computations should be spread in the distributed environment, etc.

In this paper, we propose a model to represent such streaming applications with the respect to their performance related properties. We present a conversion of the model to Colored Petri Nets (CPNs) which is used for performance analysis of the original application. The behavior of the proposed model and its conversion to the CPNs is validated through experiments. Our prediction was able to achieve nearly 100 % precise maximum delays of real stream processing applications.

Filip Nalepa, Michal Batko, Pavel Zezula
Backmatter
Metadaten
Titel
Database and Expert Systems Applications
herausgegeben von
Qiming Chen
Abdelkader Hameurlain
Farouk Toumani
Roland Wagner
Hendrik Decker
Copyright-Jahr
2015
Electronic ISBN
978-3-319-22852-5
Print ISBN
978-3-319-22851-8
DOI
https://doi.org/10.1007/978-3-319-22852-5