Query Processing I

Consistent Query Answering Using Relational Databases through Argumentation

This paper introduces a framework that integrates a reasoner based on defeasible argumentation with a large information repository backed by one or several relational databases. In our scenario, we assume that the databases involved are updated by external independent applications, possibly introducing inconsistencies in a particular database, or leading to inconsistency among the subset of databases that refer to the same data. Argumentation reasoning will contribute with the possibility of obtaining consistent answers from the information repository with the properties described. We present the

Database Integration for Defeasible Logic Programming

(DBI-DeLP) framework, which enables commonsense reasoning based on

Defeasible Logic Programming

(DeLP) by extending the system capabilities to handle large amounts of data and providing consistent answers for queries posed to it.

Cristhian A. D. Deagustini, Santiago E. Fulladoza Dalibón, Sebastián Gottifredi, Marcelo A. Falappa, Guillermo R. Simari

Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying

The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We propose a co-designed double-precision compression and indexing methodology for range queries by performing unique-value-based binning on the most significant bytes of double precision data (sign, exponent, and most significant mantissa bits), and inverting the resulting metadata to produce an inverted index over a reduced data representation. Without the inverted index, our method matches or improves compression ratios over both general-purpose and floating-point compression utilities. The inverted index is light-weight, and the overall storage requirement for both reduced column and index is less than 135%, whereas existing DBMS technologies can require 200-400%. As a proof-of-concept, we evaluate univariate range queries that additionally return column values, a critical component of data analytics, against state-of-the-art bitmap indexing technology, showing multi-fold query performance improvements.

John Jenkins, Isha Arkatkar, Sriram Lakshminarasimhan, Neil Shah, Eric R. Schendel, Stephane Ethier, Choong-Seock Chang, Jacqueline H. Chen, Hemanth Kolla, Scott Klasky, Robert Ross, Nagiza F. Samatova

Prediction, Extraction, and Annotation

Prediction of Web User Behavior by Discovering Temporal Relational Rules from Web Log Data

The Web has become a very popular and interactive medium in our lives. With the rapid development and proliferation of e-commerce and Web-based information systems, web mining has become an essential tool for discovering specific information on the Web. There are a lot of previous web mining techniques have been proposed. In this paper, an approach of temporal interval relational rule mining is applied to discover knowledge from web log data. Comparing our proposed approach and previous web mining techniques, the attribute of timestamp in web log data is considered in our approach. Firstly, temporal intervals of accessing web pages are formed by folding over a periodicity. And then discovery of relational rules is performed based on constraint of these temporal intervals. In the experiment, we analyze the result of relational rules and the effect of important parameters used in the mining approach.

Xiuming Yu, Meijing Li, Incheon Paik, Keun Ho Ryu

A Hybrid Approach to Text Categorization Applied to Semantic Annotation

In this paper, a hybrid approach is presented as a new technique for text categorization, based on machine learning techniques such as Vector-Space Model combined with n-grams. Given a specified content this technique takes care of choosing from different categories the one that best matches it. FLERSA is an annotation tool for web content where this technique is being used. The hybrid approach provides to FLERSA the capability for automatically define semantic annotations, determining the concepts that the content of a web document deals with.

José Luis Navarro-Galindo, José Samos, M. José Muñoz-Alférez

An Unsupervised Framework for Topological Relations Extraction from Geographic Documents

In this paper, we face the problem of extracting spatial relationships from geographical entities mentioned in textual documents. This is part of a research project which aims at geo-referencing document contents, hence making the realization of a Geographical Information Retrieval system possible. The driving factor of this research is the huge amount of Web documents which mention geographic places and relate them spatially. Several approaches have been proposed for the extraction of spatial relationships. However, they all assume the availability of either a large set of manually annotated documents or complex hand-crafted rules. In both cases, a rather tedious and time-consuming activity is required by domain experts. We propose an alternative approach based on the combined use of both a spatial ontology, which defines the topological relationships (classes) to be identified within text, and a nearest-prototype classifier, which helps to recognize instances of the topological relationships. This approach is unsupervised, so it does not need annotated data. Moreover, it is based on an ontology, which prevents the hand-crafting of

ad hoc

rules. Experimental results on real datasets show the viability of this approach.

Corrado Loglisci, Dino Ienco, Mathieu Roche, Maguelonne Teisseire, Donato Malerba

Failure, Fault Analysis, and Uncertainty

Combination of Machine-Learning Algorithms for Fault Prediction in High-Precision Foundries

Foundry is one of the activities that has contributed to evolve the society, however, the manufacturing process is carried out in the same manner as it was many years ago. Therefore, several defects may appear in castings when the production process is already finished. One of the most difficult defect to detect is the

microshrinkage

: tiny porosities that appear inside the casting. Another important aspect that foundries have to control are the attributes that measure the faculty of the casting to withstand several loads and tensions, also called

mechanical properties

. Both cases need specialised staff and expensive machines to test the castings and, in the second one, also, destructive inspections that render the casting invalid. The solution is to model the foundry process to apply machine learning techniques to foresee what is the state of the casting before its production. In this paper we extend our previous research and we propose a general method to foresee all the defects via building a meta-classifier combining different methods and without the need for selecting the best algorithm for each defect or available data. Finally, we compare the obtained results showing that the new approach allows us to obtain better results, in terms of accuracy and error rates, for foretelling microshrinkages and the value of mechanical properties.

Javier Nieves, Igor Santos, Pablo G. Bringas

A Framework for Conditioning Uncertain Relational Data

We propose a framework for representing conditioned probabilistic relational data. In this framework the existence of tuples in possible worlds is determined by Boolean expressions composed from elementary events. The probability of a possible world is computed from the probabilities associated with these elementary events. In addition, a set of global constraints conditions the database. Conditioning is the formalization of the process of adding knowledge to a database. Some worlds may be impossible given the constraints and the probabilities of possible worlds are accordingly re-defined. The new constraints can come from the observation of the existence or non-existence of a tuple, from the knowledge of a specific rule, such as the existence of an exclusive set of tuples, or from the knowledge of a general rule, such as a functional dependency. We are therefore interested in computing a concise representation of the possible worlds and their respective probabilities after the addition of new constraints, namely an equivalent probabilistic database instance without constraints after conditioning. We devise and present a general algorithm for this computation. Unfortunately, the general problem involves the simplification of general Boolean expressions and is NP-hard. We therefore identify specific practical families of constraints for which we devise and present efficient algorithms.

Ruiming Tang, Reynold Cheng, Huayu Wu, Stéphane Bressan

Cause Analysis of New Incidents by Using Failure Knowledge Database

Root cause analysis of failed projects and incidents is an important and necessary step to working out measures for preventing their recurrences. In this paper, to better analyze the causes of failed projects and incidents, we propose a novel topic-document-cause(TDC) model that reveals the corresponding relationships among topics, documents, and causes. We use the JST failure knowledge base to construct a TDC model with machine learning methods such as LDA and perceptron. The experimental results show that our approach performed better at discovering the causes of failures for projects and incidents.

Yuki Awano, Qiang Ma, Masatoshi Yoshikawa

Ranking and Personalization

Modeling and Querying Context-Aware Personal Information Spaces

Personal information management (PIM) is the practice and analysis of the activities performed by people to acquire, organize, maintain, and retrieve information for everyday use. It is characterized by its heterogeneity, its dispersion and its redundancy. In this paper, our reflection focused on the development of a model that will be user-based. It help him create his data model depending on his needs.The meta-model that we propose allows users creating personal information and organizing them according to different points of view (ontologies) and different contexts. Contextual queries are defined to allow users to retrieve their personal information depending on context valuation.

Rania Khéfifi, Pascal Poizat, Fatiha Saïs

Ontology-Based Recommendation Algorithms for Personalized Education

This paper presents recommendation algorithms that personalize course and curriculum content for individual students, within the broader scope of Pervasive Cyberinfrastructure for Personalizing Learning and Instructional Support (PERCEPOLIS). The context considered in making recommendations includes the academic background, interests, and computing environment of the student, as well as past recommendations made to students with similar profiles. Context provision, interpretation, and management are the services that facilitate consideration of this information. Context modeling is through a two-level hierarchy of generic and domain ontologies, respectively; reducing the reasoning search space. Imprecise query support increases the flexibility of the recommendation engine, by allowing interpretation of context provided in terms equivalent, but not necessarily identical to database access terms of the system. The relevance of the recommendations is increased by using both individual and collaborative filtering. Correct operation of the algorithms has been verified through prototyping.

Amir Bahmani, Sahra Sedigh, Ali Hurson

Towards Quantitative Constraints Ranking in Data Clustering

Expressing the expert expectations in form of boolean constraints during the data clustering process seems to be a promising issue to improve the quality of the generated clusters. However, in some real problems, an explosion in the volume of the processed data and their related constraints overwhelm the expert. In this paper, we aim to explicitly formulate the expert preferences on supervising the clustering mechanism through injecting their degree of interest on constraints using scoring functions. Therefore, we introduce our algorithm

$\mathcal{SHAQAR}$

for quantitative ranking of constraints during the data clustering. An intensive experimental evaluation, carried out on OLAP query logs collected from a financial data warehouse, showed that

$\mathcal{SHAQAR}$

outperforms the pioneer algorithms,

i.e.

Klein et al’s algorithm.

Eya Ben Ahmed, Ahlem Nabli, Faïez Gargouri

A Topic-Oriented Analysis of Information Diffusion in a Blogosphere

A blogosphere is a representative online social network established through blog users and their relationships. Understanding information diffusion is very important in developing successful business strategies for a blogosphere. In this paper, we discuss how to predict information diffusion in a blogosphere. Documents diffused over a blogosphere deal with information on

different topics

in reality. However, previous studies of information diffusion did not consider the information topic in analysis, which leads to low accuracy in predictions. In this paper, we propose a

topic-oriented model

to accurately predict information diffusion in a blogosphere. We also define four primary factors associated with topic-oriented diffusion, and propose a method to assign a diffusion probability between blog users for

each topic

by using regression analysis based on these four factors. Finally, we show the effectiveness of the proposed model through a series of experiments.

Kyu-Hwang Kang, Seung-Hwan Lim, Sang-Wook Kim, Min-Hee Jang, Byeong-Soo Jeong

Searching I

Trip Tweets Search by Considering Spatio-temporal Continuity of User Behavior

A large amount of tweets about user experiences such as trips appear on Twitter. These tweets are fragmented information and not easy to share with other people as a whole experience. In this paper, we propose a novel method to find and organize such fragmented tweets at the level of user experiences. The notable feature of our method is that we find and organize tweets related to a certain trip experience by considering the spatio-temporal continuity of user-behavior of traveling. First, we construct a co-occurrence dictionary by considering the spatio-temporal continuity; i.e., the co-occurrence ratio of two terms is varying in time scopes and regions. Then, we use such dictionary to calculate the relatedness of a tweet to the trip experience from three aspects: content relatedness, temporal relatedness, and context relatedness. Tweets with high relatedness scores will be returned as search results. The experimental results showed our method performs better than conventional keyword-based methods.

Keisuke Hasegawa, Qiang Ma, Masatoshi Yoshikawa

Incremental Cosine Computations for Search and Exploration of Tag Spaces

Tags are often used to describe user-generated content on the Web. However, the available Web applications are not incrementally dealing with new tag information, which negatively influences their scalability. Since the cosine similarity between tags represented as co-occurrence vectors is an important aspect of these frameworks, we propose two approaches for an incremental computation of cosine similarities. The first approach recalculates the cosine similarity for new tag pairs and existing tag pairs of which the co-occurrences has changed. The second approach computes the cosine similarity between two tags by reusing, if available, the previous cosine similarity between these tags. Both approaches compute the same cosine values that would have been obtained when a complete recalculation of the cosine similarities is performed. The performed experiments show that our proposed approaches are between 1.2 and 23 times faster than a complete recalculation, depending on the number of co-occurrence changes and new tags.

Raymond Vermaas, Damir Vandic, Flavius Frasincar

Impression-Aware Video Stream Retrieval System with Temporal Color-Sentiment Analysis and Visualization

To retrieve Web video intuitively, the concept of “impression” is of great importance, because many users consider feelings and moods to be one of the most significant factors motivating them to watch videos. In this paper, we propose an impression-aware video stream retrieval system for querying the visual impression of video streams by analyzing the temporal change in sentiments. As a metric of visual impression, we construct a 180-dimensional vector space called as color-impression space; each dimension corresponds to a specific adjective representing humans’ color perception. The main feature of this system is a context-dependent query processing mechanism to generate a ranking by considering the temporal transition of each video’s visual impressions on viewers’ emotion. We design an impression-aware noise reduction mechanism that dynamically reduces the number on non-zero features for each item mapped in the high-dimensional color-impression space by extracting the dominant salient impression features from a video stream. This system allows users to retrieve videos by submitting emotional queries such as “Find videos whose overall impression is happy and which have several sad and cool scenes”. Through this query processing mechanism, users can effectively retrieve videos without requiring detailed information about them.

Shuichi Kurabayashi, Yasushi Kiyoki

Database Partitioning and Performance

Dynamic Workload-Based Partitioning for Large-Scale Databases

Applications with very large databases, where data items are continuously appended, are becoming more and more common. Thus, the development of efficient workload-based data partitioning is one of the main requirements to offer good performance to most of those applications that have complex access patterns, e.g. scientific applications. However, the existing workload-based approaches, which are executed in a static way, cannot be applied to very large databases. In this paper, we propose

DynPart

, a dynamic partitioning algorithm for continuously growing databases.

DynPart

efficiently adapts the data partitioning to the arrival of new data elements by taking into account the affinity of new data with queries and fragments. In contrast to existing static approaches, our approach offers a constant execution time, no matter the size of the database, while obtaining very good partitioning efficiency. We validated our solution through experimentation over real-world data; the results show its effectiveness.

Miguel Liroz-Gistau, Reza Akbarinia, Esther Pacitti, Fabio Porto, Patrick Valduriez

Dynamic Vertical Partitioning of Multimedia Databases Using Active Rules

Vertical partitioning is a design technique widely employed in relational databases to reduce the number of irrelevant attributes accessed by the queries. Currently, due to the popularity of multimedia applications on the Internet, the need of using partitioning techniques in multimedia databases has arisen in order to use their potential advantages with regard to query optimization. In multimedia databases, the attributes tend to be of very large multimedia objects. Therefore, the reduction in the number of accesses to irrelevant objects would imply a considerable cost saving in the query execution. Nevertheless, the use of vertical partitioning techniques in multimedia databases implies two problems: 1) most vertical partitioning algorithms only take into account alphanumeric data, and 2) the partitioning process is carried out in a static way. In order to address these problems, we propose an active system called DYMOND, which performs a dynamic vertical partitioning in multimedia databases to improve query performance. Experimental results on benchmark multimedia databases clarify the validness of our system.

Lisbeth Rodríguez, Xiaoou Li

RTDW-bench: Benchmark for Testing Refreshing Performance of Real-Time Data Warehouse

In this paper we propose a benchmark, called

RTDW-bench

, for testing a performance of a real-time data warehouse. The benchmark is based on TPC-H. In particular,

RTDW-bench

permits to verify whether an already deployed RTDW is able to handle without any delays a transaction stream of a given arrival rate. The benchmark also includes an algorithm for finding the maximum stream arrival rate that can be handled by a RTDW without delays. The applicability of the proposed benchmark was verified in a RTDW implemented in Oracle11g.

Jacek Jedrzejczak, Tomasz Koszlajda, Robert Wrembel

Middleware and Language for Sensor Streams

This paper describes a new stream-based sensor networks programming approach - language + middleware. Since many data processing aspects of sensor networks can be expressed intuitively as streams, commands and events, we apply such constructs to program individual nodes. Arbitrarily defined groups of nodes can be programmed with the same language and as a single program. We show that the language is indeed able to express different intentions – we are able to configure flexibly how the sensor network will process data and how data will flow; that it is able to generate efficient code with small memory footprints; and that it provides easy integration with platform code.

Pedro Furtado

Semantic Web

Statistical Analysis of the owl:sameAs Network for Aligning Concepts in the Linking Open Data Cloud

The massively distributed publication of linked data has brought to the attention of scientific community the limitations of classic methods for achieving data integration and the opportunities of pushing the boundaries of the field by experimenting this collective enterprise that is the linking open data cloud. While reusing existing ontologies is the choice of preference, the exploitation of ontology alignments still is a required step for easing the burden of integrating heterogeneous data sets. Alignments, even between the most used vocabularies, is still poorly supported in systems nowadays whereas links between instances are the most widely used means for bridging the gap between different data sets. We provide in this paper an account of our statistical and qualitative analysis of the network of instance level equivalences in the Linking Open Data Cloud (i.e. the

sameAs

network) in order to automatically compute alignments at the conceptual level. Moreover, we explore the effect of ontological information when adopting classical Jaccard methods to the ontology alignment task. Automating such task will allow in fact to achieve a clearer conceptual description of the data at the cloud level, while improving the level of integration between datasets.

Gianluca Correndo, Antonio Penta, Nicholas Gibbins, Nigel Shadbolt

Paragraph Tables: A Storage Scheme Based on RDF Document Structure

Efficient query processing for RDF graphs is essential, because RDF is one of the most important frameworks supporting the semantic web and linked data. The performance of query processing is based on the storage layout. So far, a number of storage schemes for RDF graphs have already been proposed. However most approaches must frequently perform costly join operations, because they decompose an RDF graph into a set of triples, store them separately, and need to connect them to reconstruct a graph that matchs the query graph, and this process requires join operations. In this paper, we propose a storage scheme that stores RDF graphs as they are connected, without decomposition. We focus on RDF documents, where adjacent triples have a high relationship and may be described for the same resource. So we define a set of adjacent triples that refer to the same resource as an RDF paragraph. Our approach constructs the table layout based on the RDF paragraphs. We evaluate the performance of our approach through experiments and demonstrate that our approach outperforms other approaches in query performance in most cases.

Akiyoshi Matono, Isao Kojima

Data Mining II

Continuously Mining Sliding Window Trend Clusters in a Sensor Network

The trend cluster discovery retrieves areas of spatially close sensors which measure a numeric random field having a prominent data trend along a time horizon. We propose a computation preserving algorithm which employees an incremental learning strategy to continuously maintain sliding window trend clusters across a sensor network. Our proposal reduces the amount of data to be processed and saves the computation time as a consequence. An empirical study proves the effectiveness of the proposed algorithm to take under control computation cost of detecting sliding window trend clusters.

Annalisa Appice, Donato Malerba, Anna Ciampi

Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency

Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subsequence matching approaches. The variety of data types, specific tasks and their solutions is so wide that their proper comparison and combination suitable for a particular task might be very complicated and time-consuming. In this work, we present a new generic Subsequence Matching Framework (SMF) that tries to overcome the aforementioned problem by a uniform frame that simplifies and speeds up the design, development and evaluation of subsequence matching related systems. We identify several relatively separate subtasks solved differently over the literature and SMF enables to combine them in a straightforward manner achieving new quality and efficiency. The strictly modular architecture and openness of SMF enables also involvement of efficient solutions from different fields, for instance advanced metric-based indexes.

David Novak, Petr Volny, Pavel Zezula

Distributed Systems

R-Proxy Framework for In-DB Data-Parallel Analytics

R is a powerful programming environment for data analysis. However, when dealing with big data in R, a kind of main-memory based functional programming environment, the data movement and memory swapping become the major performance bottleneck. Therefore, executing a big-data-intensive R program could be many orders of magnitude less efficient than processing the SQL query directly inside the database for dealing with the same analytic task. Although there exists a number of “parallel-R” solutions, pushing R operations down to the parallel database layer, while retaining the natural R interface and the virtual R analytics flow, remains a very competitive alternative.

This has motivated us to develop the R-Vertica framework to scale-out R applications through in-DB, data-parallel analytics. In order to extend the R programming environment to the space of parallel query processing transparently to the R users, we introduce the notion of

R Proxy

- the R object with instance maintained in the parallel database as partitioned data sets, and schema (header) retained in the memory-based R environment. A function (such as aggregation) applied to a proxy is pushed down to the parallel database layer as SQL queries or procedures, with the query results automatically returned and converted to R objects. By providing the transparent 2-way mappings between several major types of R objects and database tables or query results, the R environment and the underlying parallel database are seamlessly integrated. The R object proxies may be created from database table schemas, in-DB operations, or the operations for persisting R objects to the database. The instances of the R proxies can be retrieved into regular R objects using SQL queries. With this framework, an R application is expressed as the analytics flow with the R objects bearing small data and the R proxies representing, but not bearing, big data. The big data are manipulated, or flow, underneath the in-memory R environment in terms of In-DB and data-parallel operations.

We have implemented the proposed approach and used it to integrate several large-scale R applications with the multi-node Vertica parallel database system. Our experience illustrates the unique feature and efficiency of this R-Vertica framework.

Qiming Chen, Meichun Hsu, Ren Wu, Jerry Shan

View Selection under Multiple Resource Constraints in a Distributed Context

The use of materialized views in commercial database systems and data warehousing systems is a common technique to improve the query performance. In past research, the view selection issue has essentially been investigated in the centralized context. In this paper, we address the view selection problem in a distributed scenario. We first extend the AND-OR view graph to capture the distributed features. Then, we propose a solution using constraint programming for modeling and solving the view selection problem under multiple resource constraints in a distributed context. Finally, we experimentally show that our approach provides better performance resulting from evaluating the quality of the solutions in terms of cost saving.

Imene Mami, Zohra Bellahsene, Remi Coletta

Web Searching and Query Answering

The Impact of Modes of Mediation on the Web Retrieval Process

This paper is concerned with the investigation of mediation between users and Web search engines and the impact of different modes of mediation on the Web search effectiveness. This involves the integration of explicit, implicit and hybrid modes of mediation within a content-based framework, facilitated by the adoption of the Vector Space Model. The work is supported by an experimental evaluation of the impact of different mediation modes on documents retrieval process in terms of recall and precision. The results of the experiments indicate that the mediation framework improves the quality of the retrieval process, and that the difference in the quality of the results is statistically significant.

Mandeep Pannu, Rachid Anane, Anne James

Querying a Semi-automated Data Integration System

A data integration system enables users to query a unified view of data sources through a global schema. We consider a semi-automatically built data integration system in the semantic web context, where data sources are annotated with ontologies. The global schema is also an ontology, expressed in

DL

-

$Lite_{\cal A}$

. After the semi-automated building of the global schema in a previous work, we focus here on the second part of this system, dedicated to the query answering process. We show how it can rely on either

GAV or LAV mappings, that are automatically computed

. We present algorithms for both cases, and we discuss the properties of this semi-automatically built data integration system.

Cheikh Niang, Béatrice Bouchou, Moussa Lo, Yacine Sam

Recommendation and Prediction Systems

A New Approach for Date Sharing and Recommendation in Social Web

The

social Web

is a set of social relations that link people through the World Wide Web. Typical social Web applications which include social media and social network services etc. have already become the mainstream of web application. User-oriented and content generated by users are pivotal characteristics of the social Web. In the circumstance of massive user generated unstructured data, data sharing and recommendation approaches take a more important role than information retrieval approaches for data diffusion. In this paper, we analyze the disadvantages of current data sharing and recommendation methods and propose an automatic group mining approach based on user preferences, which lead to sufficient data diffusion and improve the sociability between users. Intuitively, the essential idea of our approach is that users who have the same preferences towards a set of interested topics could be gathered together as a

Common Preferences Group (CPG)

. To evaluate the efficiency of the

CPG

mining algorithm and the accuracy of data recommendation based on our approach, the experiments use dataset collected from the most popular image sharing site

Flickr

. The experimental results prove the superiority of our new approach for data sharing and recommendation in social Web.

Dawen Jia, Cheng Zeng, Wenhui Nie, Zhihao Li, Zhiyong Peng

A Framework for Time-Aware Recommendations

Recently, recommendation systems have received significant attention. However, most existing approaches focus on recommending items of potential interest to users, without taking into consideration how temporal information influences the recommendations. In this paper, we argue that time-aware recommendations need to be pushed in the foreground. We introduce an extensive model for time-aware recommendations from two perspectives. From a

fresh-based

perspective, we propose using a suite of aging schemes towards making recommendations mostly depend on fresh and novel user preferences. From a

context-based

perspective, we focus on providing different suggestions under different temporal specifications. The proposed strategies are experimentally evaluated using real movies ratings.

Kostas Stefanidis, Irene Ntoutsi, Kjetil Nørvåg, Hans-Peter Kriegel

A Hybrid Time-Series Link Prediction Framework for Large Social Network

With the fast growing of Web 2.0, social networking sites such as Facebook, Twitter and LinkedIn are becoming increasingly popular. Link prediction is an important task being heavily discussed recently in the area of social networks analysis, which is to identify the future existence of links among entities in the social networks so that user experiences can be improved. In this paper, we propose a hybrid time-series link prediction model framework called DynamicNet for large social networks. Compared to existing works, our framework not only takes timing as consideration by using time-series link prediction model but also combines the strengths of topological pattern and probabilistic relational model (PRM) approaches. We evaluated our framework on three known corpora, and the favorable results indicated that our proposed approach is feasible.

Jia Zhu, Qing Xie, Eun Jung Chin

Query Processing II

A Comparison of Top-k Temporal Keyword Querying over Versioned Text Collections

As the web evolves over time, the amount of versioned text collections increases rapidly. Most web search engines will answer a query by ranking all known documents at the (current) time the query is posed. There are applications however (for example customer behavior analysis, crime investigation, etc.) that would need to efficiently query these sources as of some past time, that is, retrieve the results as if the user was posing the query in a past time instant, thus accessing data known as of that time. Ranking and searching over versioned documents considers not only keyword constraints but also the time dimension, most commonly, a time point or time range of interest. In this paper, we deal with top-k query evaluations with both keyword and temporal constraints over versioned textual documents. In addition to considering previous solutions, we propose novel data organization and indexing solutions: the first one partitions data along ranking positions, while the other maintains the full ranking order through the use of a multiversion ordered list. We present an experimental comparison for both time point and time interval constraints. For time-interval constraints, different querying definitions, such as aggregation functions and consistent top-k queries are evaluated. Experimental evaluations on large real world datasets demonstrate the advantages of the newly proposed data organization and indexing approaches.

Wenyu Huo, Vassilis J. Tsotras

An Efficient SQL Rewrite Approach for Temporal Coalescing in the Teradata RDBMS

The importance of temporal data management is manifested by a considerable attention from the database research community. This importance is becoming even more evident by the recent increasing support of temporal features in major commercial database systems. Among these systems, Teradata offers a native support to a wide range of temporal analytics. In this paper, we address the problem of temporal coalescing in the Teradata RDBMS. Temporal coalescing is a key temporal query processing operation, which merges adjacent or overlapping timestamps of value-equivalent rows. From existing approaches to implement temporal coalescing, pursuing an SQL-based approach is perhaps the most feasible and the easiest applicable. Along this direction, we propose an efficient SQL rewrite approach to implement temporal coalescing in the Teradata RDBMS by leveraging runtime conditional partitioning – a Teradata enhancement to ANSI ordered analytic functions – that enables to express the coalescing semantic in an optimized join-free single-scan SQL query. We evaluated our proposed approach over a system running Teradata 14.0 with a performance study that demonstrates its efficiency.

Mohammed Al-Kateb, Ahmad Ghazal, Alain Crolotte

HIP: Information Passing for Optimizing Join-Intensive Data Processing Workloads on Hadoop

Hadoop-based data processing platforms translate join intensive queries into multiple “jobs” (MapReduce cycles). Such multi-job workflows lead to a significant amount of data movement through the disk, network and memory fabric of a Hadoop cluster which could negatively impact performance and scalability. Consequently, techniques that minimize sizes of intermediate results will be useful in this context. In this paper, we present an information passing technique (

HIP

) that can minimize the size of intermediate data on Hadoop-based data processing platforms.

Seokyong Hong, Kemafor Anyanwu

Query Processing III

All-Visible-k-Nearest-Neighbor Queries

The All-

k

-Nearest-Neighbor (A

k

NN) operation is common in many applications such as GIS and data analysis/mining. In this paper, for the first time, we study a novel variant of A

k

NN queries, namely

All-Visible-k-Nearest-Neighbor

(AV

k

NN) query, which takes into account the impact of obstacles on the

visibility

of objects. Given a data set

P

, a query set

Q

, and an obstacle set

O

, an AV

k

NN query retrieves for each point/object in

Q

its

visible

k

nearest neighbors in

P

. We formalize the AV

k

NN query, and then propose efficient algorithms for AV

k

NN retrieval, assuming that

P

,

Q

, and

O

are indexed by conventional data-partitioning indexes (e.g., R-trees). Our approaches employ

pruning techniques

and introduce a

new pruning metric

called VMDIST. Extensive experiments using both real and synthetic datasets demonstrate the effectiveness of our presented pruning techniques and the performance of our proposed algorithms.

Yafei Wang, Yunjun Gao, Lu Chen, Gang Chen, Qing Li

Algorithm for Term Linearizations of Aggregate Queries with Comparisons

We consider the problem of rewriting queries based exclusively on views. Both queries and views can contain aggregate functions and include arithmetic comparisons. To study the equivalence of a query with its rewriting query, the so called ”linearizations of a query” need to be computed. To find the linearizations of a query, the linearizations of terms from the query need to be generated. We propose an algorithm to find these term linearizations and give a bound for its time-complexity.

Victor Felea, Violeta Felea

Evaluating Skyline Queries on Spatial Web Objects

GoogleMaps, Google Earth and Bing Maps have stimulated the visual exploration of maps on the Web and have allowed users to query spatial web objects. A spatial web object has a geographical location and usually has associated a textual description or a link to a web document. To select the objects in response to a query the data contained in their textual description or document must be inspected. Skyline is a query processing paradigm which offers queries over multiple criteria, where each criterion is equally important. In this paper we define

Location-based Textual Skyline

queries as those which use a spatial function and a function over descriptive text as criteria in a Skyline query. For example, a

Location-based Textual Skyline query

may use the distance and the relevance of keywords in the text describing a spatial web object, as criteria to retrieve objects. We define a technique to evaluate this kind of query and develop an experimental study to evaluate the proposed technique.

Alfredo Regalado, Marlene Goncalves, Soraya Abad-Mota

Alternative Query Optimization for Workload Management

Systems with heavy workloads run many queries concurrently. Modern database workloads—as those incurred by business intelligence applications—involve ad-hoc, highly complex, expensive queries. While query plans are optimized individually, the workload overall is not. Plans running together incur resource contention, resulting in sub-optimal performance. To address this, we introduce the idea of

alternative-objective query optimization

. Multiple query plans for the same query are generated, each optimized for an alternative resource usage. At runtime, the workload manager then can choose the plan for the query that works best for runtime conditions. This balances the system load, reducing contention, to increase overall workload throughput.

Zahid Abul-Basher, Yi Feng, Parke Godfrey, Xiaohui Yu, Mokhtar Kandil, Danny Zilio, Calisto Zuzarte

Searching II

Online Top-k Similar Time-Lagged Pattern Pair Search in Multiple Time Series

We extract the relation among multiple time series in which a characteristic pattern in a time series follows a similar pattern in another time series. We call this a ‘post-hoc-relation’. For extracting many post-hoc-relations from a large number of time series, we investigated the problem of reducing the cost of online searching for the top-

k

similar time-lagged pattern pairs in multiple time series, where

k

is the query size. We propose an online top-

k

similar time-lagged pattern pair search method that manages the candidate cache in preparation for the top-

k

pair update and defines the upper bound distance for each arrival time of pattern pairs. Our method also prunes dissimilar pattern pairs by using an index and the upper bound distance. Experimental results show that our method successfully reduces the number of distance computations for a top-

k

similar pattern update.

Hisashi Kurasawa, Hiroshi Sato, Motonori Nakamura, Hajime Matsumura

Improving the Performance for the Range Search on Metric Spaces Using a Multi-GPU Platform

Nowadays, similarity search is becoming a field of increasing interest because these kinds of methods can be applied to different areas in science and engineering, for instance, pattern recognition, information retrieval, etc. This search is carried out over metric indexes decreasing the number of distance evaluations during the search process, improving the efficiency of this process. However, for real applications, when processing large volumes of data, query response time can be quite high. In this case, it is necessary to apply mechanisms in order to significantly reduce the average query response time. In this sense, the parallelization of the metric structures processing is an interesting field of research. Modern GPU/Multi-GPU systems offer a very impressive cost/performance ratio. In this paper, we show a simple and fast implementation of similarity search method on a Multi-GPU platform. The main contributions are mainly the definition of a generic metric structure more suitable for GPU platforms, the efficient usage of GPU memory system and the implementation of the method in a Multi-GPU platform.

Roberto Uribe-Paredes, Enrique Arias, José L. Sánchez, Diego Cazorla, Pedro Valero-Lara

A Scheme of Fragment-Based Faceted Image Search

Retrieving desired images from large amounts of images has been increasingly important. Traditionally keyword search and content-based image retrieval have been used for image search. However, such retrieval methods have scarcely assumed a case when users have few terms for desired images or few images similar to the desired images. For this problem, we use

fragments

of images extracted from datasets as values of facets in faceted navigation. We extract parts of images as fragments. Then, we make a several number of groups for each part to decide representative images for faceted navigation. Our empirical user study based on an example application using face image data, FUKUWARAI, shows that our proposal successfully supports users to find desired images.

Takahiro Komamizu, Mariko Kamie, Kazuhiro Fukui, Toshiyuki Amagasa, Hiroyuki Kitagawa

Indexing Metric Spaces with Nested Forests

Searching for similar objects in a dataset, with respect to a query object and a distance, is a fundamental problem for several applications that use complex data. The main difficulties is to focus the search on as few elements as possible and to further limit the computationally-extensive distance calculations between them. Here, we introduce a forest data structure for indexing and querying such data. The efficiency of our proposal is studied through experiments on real-world datasets and a comparison with previous proposals.

José Martinez, Zineddine Kouahla

Business Processes and Social Networking

Navigating in Complex Business Processes

In order to provide information needed in knowledge-intense business processes, large companies often establish intranet portals, which enable access to their process handbook. Especially, for large business processes comprising hundreds or thousands of process steps, these portals can help to avoid time-consuming access to paper-based process documentation. However, business processes are usually presented in a rather static manner within these portals, e.g., as simple drawings or textual descriptions. Companies therefore require new ways of making large processes and process-related information better explorable for end-users. This paper picks up this issue and presents a formal navigation framework based on linear algebra for navigating in large business processes.

Markus Hipp, Bela Mutschler, Manfred Reichert

Combining Information and Activities in Business Processes

This paper presents a notation, Chant, aimed at integrating the activity-centric perspective and the data-centric one in the domain of business processes. Chant process models provide a network-oriented structure in which the tasks, the choice blocks and the flow handlers are the nodes and their interconnections represent the information flows. The companion information models define the structure of the business entities: they place emphasis on mandatory attributes and associations so as to enable the readers to easily understand the effects brought about by the tasks.

Giorgio Bruno

Opinion Extraction Applied to Criteria

The success of Information technologies and associated services (

e.g.

, blogs, forums,...) eases the way to express massive opinion on various topics. Recently new techniques known as

opinion mining

have emerged. One of their main goals is to automatically extract a global trend from expressed opinions. While it is quite easy to get this overall assessment, a more detailed analysis will highlight that opinions are expressed on more specific topics: one will acclaim a movie for its soundtrack and another will criticize it for its scenario. Opinion mining approaches have little explored this multicriteria aspect. In this paper we propose an automatic extraction of text segments related to a set of criteria. The opinion expressed in each text segment is then automatically extracted. From a small set of opinion keywords, our approach automatically builds a training set of texts from the web. A lexicon reflecting the polarity of words is then extracted from this training corpus. This lexicon is then used to compute the polarity of extracted text segments. Experiments show the efficiency of our approach.

Benjamin Duthil, François Trousset, Gérard Dray, Jacky Montmain, Pascal Poncelet

SocioPath: Bridging the Gap between Digital and Social Worlds

Everyday, people use more and more digital resources (data, application systems, Internet,

etc.

) for all aspects of their life, like financial management, private exchanges, collaborative work,

etc

. This leads to non-negligible dependences on the digital distributed resources that reveal strong reliance at the social level. Users are often not aware of their real autonomy regarding the management of their digital resources. People underestimate social dependences generated by the system they use and the resulting potential risks. We argue that it is necessary to be aware of some key aspects of system’s architectures to be able to know dependences. This work proposes

SocioPath

, a generic meta-model to derive dependences generated by system’s architectures. It focuses on relations, like access, control, ownership among different entities of the system (digital resources, hardware, persons,

etc.

). Enriched with deduction rules and definitions,

SocioPath

reveals the dependence of a person on each entity in the system. This meta-model can be useful to evaluate a system, as a modeling tool that bridges the gap between the digital and the social worlds.

Nagham Alhadad, Philippe Lamarre, Yann Busnel, Patricia Serrano-Alvarado, Marco Biazzini, Christophe Sibertin-Blanc

Data Security, Privacy, and Organization

Detecting Privacy Violations in Multiple Views Publishing

We present a sound data-value-dependent method of detecting privacy violations in the context of multiple views publishing. We assume that privacy violation takes the form of linkages, that is, identifier-privacy value pair appearing in the same data record. At first, we perform a theoretical study of the following security problem: given a set of views to be published, if linking of two views does not violate privacy, how about three or more of them? And how many potential leaking channels are there? Then we propose a pre-processing algorithm of views which can turn multi-view violation detection problem into the single view case. Next, we build a benchmark with publicly available data set, Adult Database, at the UC Irvine Machine Learning Repository, and identity data set generated using a coherent database generator called Fake Name Generator on the internet. Finally, we conduct some experiments via Cayuga complex event processing system, the results demonstrate that our approach is practical, and well-suited to efficient privacy-violation detection.

Deming Dou, Stéphane Coulondre

Anomaly Discovery and Resolution in MySQL Access Control Policies

Managing hierarchical and fine grained DBMS policies for a large number of users is a challenging task and it increases the probability of introducing misconfigurations and anomalies. In this paper, we present a formal approach to discover anomalies in database policies using Binary Decision Diagrams (BDDs) which allow finer grain analysis and scalability. We present and formalize intra-table and inter-table redundancy anomalies using the popular MySQL database server as a case study. We also provide a mechanism for improving the performance of policy evaluation by upgrading rules from one grant table to another grant table. We implemented our proposed approach as a tool called

MySQLChecker

. The experimental results show the efficiency of

MySQLChecker

in finding and resolving policy anomalies.

Mohamed Shehab, Saeed Al-Haj, Salil Bhagurkar, Ehab Al-Shaer

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Query Processing I

Consistent Query Answering Using Relational Databases through Argumentation

Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying

Prediction, Extraction, and Annotation

Prediction of Web User Behavior by Discovering Temporal Relational Rules from Web Log Data

A Hybrid Approach to Text Categorization Applied to Semantic Annotation

An Unsupervised Framework for Topological Relations Extraction from Geographic Documents

Failure, Fault Analysis, and Uncertainty

Combination of Machine-Learning Algorithms for Fault Prediction in High-Precision Foundries

A Framework for Conditioning Uncertain Relational Data

Cause Analysis of New Incidents by Using Failure Knowledge Database

Ranking and Personalization

Modeling and Querying Context-Aware Personal Information Spaces

Ontology-Based Recommendation Algorithms for Personalized Education

Towards Quantitative Constraints Ranking in Data Clustering

A Topic-Oriented Analysis of Information Diffusion in a Blogosphere

Searching I

Trip Tweets Search by Considering Spatio-temporal Continuity of User Behavior

Incremental Cosine Computations for Search and Exploration of Tag Spaces

Impression-Aware Video Stream Retrieval System with Temporal Color-Sentiment Analysis and Visualization

Database Partitioning and Performance

Dynamic Workload-Based Partitioning for Large-Scale Databases

Dynamic Vertical Partitioning of Multimedia Databases Using Active Rules

RTDW-bench: Benchmark for Testing Refreshing Performance of Real-Time Data Warehouse

Middleware and Language for Sensor Streams

Semantic Web

Statistical Analysis of the owl:sameAs Network for Aligning Concepts in the Linking Open Data Cloud

Paragraph Tables: A Storage Scheme Based on RDF Document Structure

Data Mining II

Continuously Mining Sliding Window Trend Clusters in a Sensor Network

Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency

Distributed Systems

R-Proxy Framework for In-DB Data-Parallel Analytics

View Selection under Multiple Resource Constraints in a Distributed Context

Web Searching and Query Answering

The Impact of Modes of Mediation on the Web Retrieval Process

Querying a Semi-automated Data Integration System

Recommendation and Prediction Systems

A New Approach for Date Sharing and Recommendation in Social Web

A Framework for Time-Aware Recommendations

A Hybrid Time-Series Link Prediction Framework for Large Social Network

Query Processing II

A Comparison of Top-k Temporal Keyword Querying over Versioned Text Collections

An Efficient SQL Rewrite Approach for Temporal Coalescing in the Teradata RDBMS

HIP: Information Passing for Optimizing Join-Intensive Data Processing Workloads on Hadoop

Query Processing III

All-Visible-k-Nearest-Neighbor Queries

Algorithm for Term Linearizations of Aggregate Queries with Comparisons

Evaluating Skyline Queries on Spatial Web Objects

Alternative Query Optimization for Workload Management

Searching II

Online Top-k Similar Time-Lagged Pattern Pair Search in Multiple Time Series

Improving the Performance for the Range Search on Metric Spaces Using a Multi-GPU Platform

A Scheme of Fragment-Based Faceted Image Search

Indexing Metric Spaces with Nested Forests

Business Processes and Social Networking

Navigating in Complex Business Processes

Combining Information and Activities in Business Processes

Opinion Extraction Applied to Criteria

SocioPath: Bridging the Gap between Digital and Social Worlds

Data Security, Privacy, and Organization

Detecting Privacy Violations in Multiple Views Publishing

Anomaly Discovery and Resolution in MySQL Access Control Policies

Backmatter

Premium Partner