Skip to main content
Top

2015 | Book

Database and Expert Systems Applications

26th International Conference, DEXA 2015, Valencia, Spain, September 1-4, 2015, Proceedings, Part I

Editors: Qiming Chen, Abdelkader Hameurlain, Farouk Toumani, Roland Wagner, Hendrik Decker

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This two volume set LNCS 9261 and LNCS 9262 constitutes the refereed proceedings of the 26th International Conference on Database and Expert Systems Applications, DEXA 2015, held in Valencia, Spain, September 1-4, 2015.

The 40 revised full papers presented together with 32 short papers, and 2 keynote talks, were carefully reviewed and selected from 125 submissions. The papers discuss a range of topics including: temporal, spatial and high dimensional databases; semantic Web and ontologies; modeling, linked open data; NoSQLm NewSQL, data integration; uncertain data and inconsistency tolerance; database system architecture; data mining, query processing and optimization; indexing and decision support systems; modeling, extraction, social networks; knowledge management and consistency; mobility, privacy and security; data streams, Web services; distributed, parallel and cloud databases; information retrieval; XML and semi-structured data; data partitioning, indexing; data mining, applications; WWW and databases; data management algorithms.

These volumes also include accepted papers of the 8th International Conference on Data Management in Cloud, Grid and P2P Systems, Globe 2015, held in Valencia, Spain, September 2, 2015. The 8 full papers presented were carefully reviewed and selected from 13 submissions. The papers discuss a range of topics including: MapReduce framework: load balancing, optimization and classification; security, data privacy and consistency; query rewriting and streaming.

Table of Contents

Frontmatter

Keynote Talk

Frontmatter
Pattern Recognition in Embedded Systems: An Overview

Many Pattern Recognition tasks must be implemented with embedded systems, since interactions with humans and with the environment are among the most frequent functions they perform. Traditionally seen as an area of Artificial Intelligence (AI) that comes after the perception process, a classical PR application takes a set of physical measures obtained by sensors or communication subsystems and generates actions using actuators or provides information through the display or an I/O port.

In this talk, some of the application domains, optimization techniques and important, tradeoffs associated to the computational issues and time constraints involved, along with open problems and solution methodologies, are presented.

Juan-Carlos Perez-Cortes

Temporal, Spatial and High Dimensional Databases

Frontmatter
Restricted Shortest Path in Temporal Graphs

The

restricted shortest path

(RSP) problem on directed networks is a well-studied problem, and has a large number of applications such as in

Quality of Service

routing. The problem is known to be

NP-hard

. In certain cases, however, the network is not static i.e., edge parameters vary over time. In light of this, we extend the RSP problem for general networks to that for

temporal networks

. We present several exact algorithms for this problem, one of which uses heuristics, and is similar to the

$$A^*$$

algorithm. We experimentally evaluate these algorithms by simulating them on both existing temporal networks, and synthetic ones. Furthermore, based on one of the pseudo-polynomial exact algorithms, we derive a fully polynomial time approximation scheme.

Sudip Biswas, Arnab Ganguly, Rahul Shah
An Efficient Distributed Index for Geospatial Databases

The recent and rapid growth of GPS-enabled devices has resulted in an explosion of spatial data. There are three main challenges for managing and querying such data: the massive volume of data, the need for a high insertion throughput and enabling real-time spatial queries. Although key–value store databases handle large-scale data effectively, they are not equipped with effective functions for supporting spatial data. To solve this problem, we propose an efficient spatial index structure based on HBase, a standard key–value store database. We first use Geohash as the rowkey in HBase to sustain high insert throughput. We present a novel data structure, the binary Geohash rectangle-partition tree, that partitions data into subrectangles, then add these subrectangles into an R-Tree to support spatial queries. Our experiments demonstrate high scalability and an improved performance with spatial queries, when compared with a state-of-the-art system. They also show a good real-time query-processing capability, with response times of less than one second.

Le Hong Van, Atsuhiro Takasu
The xBR $$^+$$ -tree: An Efficient Access Method for Points

Spatial indexes, such as those based on Quadtree, are important in spatial databases for efficient execution of queries involving spatial constraints. In this paper, we present improvements of the xBR-tree (a member of the Quadtree family) with modified internal node structure and tree building process, called xBR

$$^+$$

-tree. We highlight the differences of the algorithms for processing single dataset queries between the xBR and xBR

$$^+$$

-trees and we demonstrate performance results (I/O efficiency and execution time) of extensive experimentation (based on real and synthetic datasets) on tree building process and processing of single dataset queries, using the two structures. These results show that the two trees are comparable, regarding their building performance, however, the xBR

$$^+$$

-tree is an overall winner, regarding spatial query processing.

George Roumelis, Michael Vassilakopoulos, Thanasis Loukopoulos, Antonio Corral, Yannis Manolopoulos

Semantic Web and Ontologies

Frontmatter
Probabilistic Error Detecting in Numerical Linked Data

Linked Open Data (LOD) has become a vast repository with billions of triples available in thousands of datasets. One of the most pressing challenges to Linked Data management is detecting errors in numerical data. This study proposes a novel probabilistic framework that enables the detection of inconsistencies in numerical attributes including not only integer, float or double values but also date values of liked data. We develop an automatic method to detect error between multi attributes which can not be detected only considering single attribute. Evaluations are performed on four DBpedia versions from 3.7 to 2014 which are a central hub dataset of LOD cloud. Results show that our approach reaches

$$96\,\%$$

precision when testing on DBpedia 2014 with threshold

$$\alpha =0.9$$

. We also compare the percentage distribution of errors between different DBpedia versions and analyze two basic classes of causes that lead to errors. Efficiency evaluation results confirm the scalability of our approach to large Linked Data repositories.

Huiying Li, Yuanyuan Li, Feifei Xu, Xinyu Zhong
From General to Specialized Domain: Analyzing Three Crucial Problems of Biomedical Entity Disambiguation

Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. Most disambiguation systems focus on general purpose knowledge bases like DBpedia but leave out the question how those results generalize to more specialized domains. This is very important in the context of Linked Open Data, which forms an enormous resource for disambiguation. We implement a ranking-based (Learning To Rank) disambiguation system and provide a systematic evaluation of biomedical entity disambiguation with respect to three crucial and well-known properties of specialized disambiguation systems. These are (i) entity context, i.e. the way entities are described, (ii) user data, i.e. quantity and quality of externally disambiguated entities, and (iii) quantity and heterogeneity of entities to disambiguate, i.e. the number and size of different domains in a knowledge base. Our results show that (i) the choice of entity context that is used to attain the best disambiguation results strongly depends on the amount of available user data, (ii) disambiguation results with large-scale and heterogeneous knowledge bases strongly depend on the entity context, (iii) disambiguation results are robust against a moderate amount of noise in user data and (iv) some results can be significantly improved with a federated disambiguation approach that uses different entity contexts. Our results indicate that disambiguation systems must be carefully adapted when expanding their knowledge bases with special domain entities.

Stefan Zwicklbauer, Christin Seifert, Michael Granitzer
Ontology Matching with Knowledge Rules

Ontology matching is the process of automatically determining the semantic equivalences between the concepts of two ontologies. Most ontology matching algorithms are based on two types of strategies: terminology-based strategies, which align concepts based on their names or descriptions, and structure-based strategies, which exploit concept hierarchies to find the alignment. In many domains, there is additional information about the relationships of concepts represented in various ways, such as Bayesian networks, decision trees, and association rules. We propose to use the similarities between these relationships to find more accurate alignments. We accomplish this by defining soft constraints that prefer alignments where corresponding concepts have the same local relationships encoded as

knowledge rules

. We use a probabilistic framework to integrate this new

knowledge-based

strategy with standard terminology-based and structure-based strategies. Furthermore, our method is particularly effective in identifying correspondences between complex concepts. Our method achieves better F-score than the state-of-the-art on three ontology matching domains.

Shangpu Jiang, Daniel Lowd, Dejing Dou

Modeling, Linked Open Data

Frontmatter
Detection of Sequences with Anomalous Behavior in a Workflow Process

A workflow process consists of an organized and repeatable pattern of activities that are necessary to complete a task, within the dynamics of an organization. The automatic recognition of deviations from the expected behavior within the workflow of an organization is crucial to provide assistance to new employees to accomplish his/her tasks. In this article, we propose a two-fold approach to this problem. First, taking the process logs as an input, we automatically build a statistical model that captures regularities in the activities carried out by the employees. Second, this model is used to track the activities performed by the employees to detect deviations from the expected behavior, according to the normal workflow of the organization. An experimental evaluation with five processes logs, with different levels of noise, was conducted to determine the validity of our approach.

Marcelo G. Armentano, Analía A. Amandi
An Energy Model for Detecting Community in PPI Networks

Community detection problem has been studied for years, but no agreement on the common definition of community has been reached till now. One topology is found to be of different types, unipartite or bipartite/multipartite. The previous literatures mostly proposed only one type of community. As the definition of community is understood differently, the grouping results are always applicable to some specified networks. If the definition changes, the grouping result will no longer be “good”. In the present paper, it’s found that some vertices “must be” in the same community, while some maybe in or not in the same community. To do the community detection on mixed protein-protein interaction (PPI) networks, an energy model with two steps is proposed. First, vertices that “must be” in the same community are grouped together by properties; second, overlapping vertices are found by functions. In the energy model, “positive energy” is defined by attraction generated between two vertices, and “negative energy” by the attraction, which weakens the attraction between the corresponding two vertices, resulting from other vertices. Energy between two vertices is the sum of their positive and negative energy. Computing the energy of each community, the community structure can be found when maximum value of the energy sum of these communities is obtained. The model is utilized to find community structure in PPI networks. The results show that the energy model is applicable to unipartite, bipartite or mixed PPI networks. Vertices with similar property/roles in the same community and overlapping vertices are found for the network.

Yin Pang, Lin Bai, Kaili Bu
Filtering Inaccurate Entity Co-references on the Linked Open Data

The Linked Open Data (LOD) initiative relies heavily on the interconnections between different open RDF datasets where RDF links are used to connect resources. There has already been substantial research on identifying identity links between resources from different datasets, a process that is often referred to as co-reference resolution. These techniques often rely on probabilistic models or inference mechanisms to detect identity relations. However, recent studies have shown considerable inaccuracies in the LOD datasets that pertain to identity relations, e.g., owl:sameAs relations. In this paper, we propose a technique that evaluates existing identity links between LOD resources and identifies potentially erroneous links. Our work relies on the position and relevance of each resource with regards to the associated DBpedia categories modeled through two probabilistic category distribution and selection functions. Our experimental results show that our work is able to semantically distinguish inaccurate identity links even in cases when high syntactical similarity is observed between two resources.

John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic
Quality Metrics for Linked Open Data

The vision of the Linked Open Data (LOD) initiative is to provide a model for publishing data and meaningfully interlinking such dispersed but related data. Despite the importance of data quality for the successful growth of the LOD, only limited attention has been focused on quality of data prior to their publication on the LOD. This paper focuses on the systematic assessment of the quality of datasets prior to publication on the LOD cloud. To this end, we identify important quality deficiencies that need to be avoided and/or resolved prior to the publication of a dataset. We then propose a set of metrics to measure and identify these quality deficiencies in a dataset. This way, we enable the assessment and identification of undesirable quality characteristics of a dataset through our proposed metrics.

Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri

NoSQL, NewSQL, Data Integration

Frontmatter
A Framework of Write Optimization on Read-Optimized Out-of-Core Column-Store Databases

The column-store database features a faster data reading speed and higher data compression efficiency compared with traditional row-based databases. However, optimizing write operations in the column-store database is one of the well-known challenges. Most existing works on write performance optimization focus on main-memory column-store databases. In this work, we investigate optimizing write operation (update and deletion) on out-of-core (OOC, or external memory) column-store databases. We propose a general framework to work for both normal OOC storage or big data storage, such as Hadoop Distributed File System (HDFS). On normal OOC storage, we propose an innovative data storage format called Timestamped Binary Association Table (or TBAT). Based on TBAT, a new update method, called Asynchronous Out-of-Core Update (or AOC Update), is designed to replace the traditional update. On big data storage, we further extend TBAT onto HDFS and propose the Asynchronous Map-Only Update (or AMO Update) to replace the traditional update. Fast selection methods are developed in both contexts to improve data retrieving speed. A significant improvement in speed performance is shown in the extensive experiments when performing write operations on TBAT in normal and Map-Reduce environment.

Feng Yu, Wen-Chi Hou
Integrating Big Data and Relational Data with a Functional SQL-like Query Language

Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in HDFS with relational data. One main solution is to use a relational query engine that allows SQL-like queries to retrieve data from HDFS, which requires the system to provide a relational view of the unstructured data and hence is not always feasible. In this paper, we introduce a functional SQL-like query language that can integrate data retrieved from different data stores and take full advantage of the functionality of the underlying data processing frameworks by allowing the ad hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements. Furthermore, the query language allows for optimization by enabling subquery rewriting so that filter conditions can be pushed inside and executed at the data store as early as possible. Our approach is validated with two data stores and a representative query that demonstrates the usability of the query language and evaluates the benefits from query optimization.

Carlyna Bondiombouy, Boyan Kolev, Oleksandra Levchenko, Patrick Valduriez
Comparative Performance Evaluation of Relational and NoSQL Databases for Spatial and Mobile Applications

Selecting the appropriate data management infrastructure is still a hard task for the designers of mobile applications with large volumes of data. Considering NoSQL needs for such applications, this paper demonstrates how the physical implementation of the database may impact query performance. Specifically, we consider the needs of mobile users that involve constant spatial data traffic, such as querying for points of interest, map visualization, zooming and panning, routing and location tracking. We define a workload and process such queries over three types of databases: relational, document-based and graph-based. Our evaluation shows that a fair comparison requires specific workloads for each mobile feature, but that is not possible using the industry’s standard benchmark tools. Overall, the paper shows that physical design must evolve to take advantage of the performance of NoSQL databases while keeping data consistency and integrity.

Pedro O. Santos, Mirella M. Moro, Clodoveu A. Davis Jr.

Uncertain Data and Inconsistency Tolerance

Frontmatter
Query Answering Explanation in Inconsistent Datalog $$+/-$$ Knowledge Bases

The paper addresses the problem of explaining Boolean Conjunctive Query (BCQ) entailment in the presence of inconsistency within the Ontology-Based Data Access (OBDA) setting, where inconsistency is handled by the intersection of closed repairs semantics (ICR) and the ontology is represented by Datalog

$$+/-$$

rules. We address this problem in the case of both BCQ acceptance and failure by adopting a logical instantiation of abstract argumentation model; that is, in order to explain why the query is accepted or failed, we look for proponent or opponent sets of arguments in favor or against the query acceptance. We have also studied the computational complexity of the problem of finding an arbitrary explanation as well as all explanations.

Abdallah Arioua, Nouredine Tamani, Madalina Croitoru
PARTY: A Mobile System for Efficiently Assessing the Probability of Extensions in a Debate

In this paper we propose

PARTY

, a

P

robabilistic

A

bstract a

R

gumen

T

ation s

Y

stem that assesses the probability that a set of arguments is an

extension

according to a semantics.

PARTY

deals with five popular semantics, i.e.,

admissible

,

stable

,

complete

,

grounded

, and

preferred

: it implements polynomial algorithms for computing the probability of the extensions for

admissible

and

stable

semantics and it implements an efficient Monte-Carlo simulation algorithm for estimating the probability of the extensions for the other semantics, which have been shown to be intractable in [

19

,

20

]. The experimental evaluation shows that

PARTY

is more efficient than the state-of-the art approaches and that it can be profitable executed on devices having reduced computational resources.

Bettina Fazzinga, Sergio Flesca, Francesco Parisi, Adriana Pietramala
Uncertain Groupings: Probabilistic Combination of Grouping Data

Probabilistic approaches for data integration have much potential [

7

]. We view data integration as an iterative process where data understanding gradually increases as the data scientist continuously refines his view on how to deal with learned intricacies like data conflicts. This paper presents a probabilistic approach for integrating data on groupings. We focus on a bio-informatics use case concerning homology. A bio-informatician has a large number of homology data sources to choose from. To enable querying combined knowledge contained in these sources, they need to be integrated. We validate our approach by integrating three real-world biological databases on homology in three iterations.

Brend Wanders, Maurice van Keulen, Paul van der Vet

Database System Architecture

Frontmatter
Cost-Model Oblivious Database Tuning with Reinforcement Learning

In this paper, we propose a learning approach to adaptive performance tuning of database applications. The objective is to validate the opportunity to devise a tuning strategy that does not need prior knowledge of a cost model. Instead, the cost model is learned through reinforcement learning. We instantiate our approach to the use case of index tuning. We model the execution of queries and updates as a Markov decision process whose states are database configurations, actions are configuration changes, and rewards are functions of the cost of configuration change and query and update evaluation. During the reinforcement learning process, we face two important challenges: not only the unavailability of a cost model, but also the size of the state space. To address the latter, we devise strategies to prune the state space, both in the general case and for the use case of index tuning. We empirically and comparatively evaluate our approach on a standard OLTP dataset. We show that our approach is competitive with state-of-the-art adaptive index tuning, which is dependent on a cost model.

Debabrota Basu, Qian Lin, Weidong Chen, Hoang Tam Vo, Zihong Yuan, Pierre Senellart, Stéphane Bressan
Towards Making Database Systems PCM-Compliant

Phase Change Memory (PCM) is a new

non-volatile

memory technology that is comparable to traditional DRAM with regard to read latency, and markedly superior with regard to storage density and idle power consumption. Due to these desirable characteristics, PCM is expected to play a significant role in the next generation of computing systems. However, it also has limitations in the form of expensive writes and limited write endurance. Accordingly, recent research has investigated how database engines may be redesigned to suit DBMS deployments on the new technology.

In this paper, we address the pragmatic goal of minimally altering current implementations of database operators to make them PCM-conscious, the objective being to facilitate an easy transition to the new technology. Specifically, we target the implementations of the “workhorse” database operators:

sort

,

hash join

and

group-by

, and rework them to substantively improve the write performance without compromising on execution times. Concurrently, we provide simple but effective

estimators

of the writes incurred by the new techniques, and these estimators are leveraged for integration with the query optimizer.

Our new techniques are evaluated on TPC-H benchmark queries with regard to the following metrics: number of writes, response times and wear distribution. The experimental results indicate that the PCM-conscious operators collectively reduce the number of writes by a factor of 2 to 3, while concurrently improving the query response times by about 20 % to 30 %. When combined with the appropriate plan choices, the improvements are even higher. In essence, our algorithms provide both short-term and long-term benefits. These outcomes augur well for database engines that wish to leverage the impending transition to PCM-based computing.

Vishesh Garg, Abhimanyu Singh, Jayant R. Haritsa
Workload-Aware Self-Tuning Histograms of String Data

In this paper we extend STHoles, a very successful algorithm that uses query results to build and maintain multi-dimensional histograms of numerical data. Our contribution is the formal definition of extensions of all relevant concepts; such that they are independent of the domain of the data, but subsume STHoles concepts as their numerical specialization. At the same time, we also derive specializations for the string domain and implement these into a prototype that we use to empirically validate our approach. Our current implementation uses

string prefixes

as the machinery for describing string ranges. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in hierarchically structured string domains, such as those of filesystem pathnames and URIs. In fact, we base the empirical validation of the approach on existing, publicly available Semantic Web data where we demonstrate convergence to accurate and efficient histograms.

Nickolas Zoulis, Effrosyni Mavroudi, Anna Lykoura, Angelos Charalambidis, Stasinos Konstantopoulos

Data Mining I

Frontmatter
Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when (i) the amount of data tends to be very large and/or (ii) the minimum support (

MinSup

) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently, relying on an absolute minimum support (

AMinSup

) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

Saber Salah, Reza Akbarinia, Florent Masseglia
Does Multilevel Semantic Representation Improve Text Categorization?

This paper presents a novel approach for text categorization by fusing “Bag-of-words” (BOW) word feature and multilevel semantic feature (SF). By extending Online LDA (OLDA) as multilevel topic model for learning a semantic space with different topic granularity, multilevel semantic features are extracted for representing text component. The effectiveness of our approach is evaluated on both large scale Wikipedia corpus and middle-sized 20newsgroups dataset. The former experiment shows that our approach is able to preform semantic feature extraction on large scale dataset. It also demonstrates the topics generated from different topic level have different semantic scopes, which is more appropriate to represent text content. Our classification experiments on 20newsgroups achieved 82.19 % accuracy, which illustrates the effectiveness of fusing BOW and SF features. The further investigation on word and semantic feature fusion proves that Support Vector Machine (SVM) is more sensitive to semantic feature than Naive Bayes (NB), K Nearest Neighbor(KNN), Decision Tree (DT). It is shown that appropriately fusing low-level word feature and high-level semantic feature can achieve equally well or even better result than state-of-the-art with reduced feature dimension and computational complexity.

Cheng Wang, Haojin Yang, Christoph Meinel
Parallel Canopy Clustering on GPUs

Canopy clustering is a preprocessing method for standard clustering algorithms such as

k

-means and hierarchical agglomerative clustering. Canopy clustering can greatly reduce the computational cost of clustering algorithms. However, canopy clustering itself may also take a vast amount of time for handling massive data, if we naïvely implement it. To address this problem, we present efficient algorithms and implementations of canopy clustering on GPUs, which have evolved recently as general-purpose many-core processors. We not only accelerate the computation of original canopy clustering, but also propose an algorithm using grid index. This algorithm partitions the data into cells to reduce redundant computations and, at the same time, to exploit the parallelism of GPUs. Experiments show that the proposed implementations on the GPU is 2 times faster on average than multi-threaded, SIMD implementations on two octa-core CPUs.

Yusuke Kozawa, Fumitaka Hayashi, Toshiyuki Amagasa, Hiroyuki Kitagawa

Query Processing and Optimization

Frontmatter
Efficient Storage and Query Processing of Large String in Oracle

Variable size strings are a fundamental data type in RDBMS and used in virtually all database components and applications including XMLs, blogging, customer service comments, e-commerce product descriptions, etc. Many applications could require large strings to store XML documents, JSON documents, customer support history, blog entries, or HTML documents. Many social network applications as well as web 3.0 applications also require large string type. A naïve implementation would simply increase the size of the traditional variable strings, but this will incur performance problems due to row chaining. Using Large Object type (LOB) will enable users to store large string without row chaining, but it is difficult to manipulate LOBs and many built-in operators for strings are not applicable to LOBs. Oracle 12c provides a capability of storing large strings without the row-chaining problem while eliminating LOB?s deficiencies. In addition, users can control the data placement and storage format based on their application workload. However, reading the large string from storage for each reference to the string would be inefficient for queries that reference the strings frequently. This paper presents an efficient processing strategy for queries involving large strings, while supporting theoretically unlimited size of the strings. It illustrates how seemingly simple conceptual work involves careful design and extensive engineering work to have a scalable and efficient implementation. The solution has been implemented in Oracle 12c, and the performance results show its efficiency.

George Eadon, Eugene Inseok Chong, Ananth Raghavan
SAM: A Sorting Approach for Optimizing Multijoin Queries

Finding the optimal join order for a multijoin query is an old, yet very important topic for relational database systems. It has been studied for the last few decades and proven to be NP-hard. The mainstream techniques, first proposed in System R, are based on dynamic programming. These techniques are widely adopted by commercial database systems.

However, it is well known that such approaches suffer from exponential running time in finding the optimal join order for most queries, except simple ones like linear queries. Therefore, a query optimizer must resort to finding a suboptimal join order when the number of tables is large.

This paper proposes SAM, which departs from current practice in two ways: (1) SAM orders the joining attributes before ordering the tables; (2) SAM sorts the tables by comparing selectivities for “table blocks”. This approach reduces the exponential time complexity in the optimization; in particular, it can find, in polynomial time, the optimal ordering for clique queries that take exponential time to optimize by dynamic programming.

Experiments comparing SAM to the query optimizers in MySQL and PostgreSQL, using real data, show that its performance is similar for small queries, but much better for large queries.

Yong Zeng, Amy Nan Lu, Lu Xia, Chris Xing Tian, Y. C. Tay
GPU Acceleration of Set Similarity Joins

We propose a scheme of efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without renouncing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.

Mateus S. H. Cruz, Yusuke Kozawa, Toshiyuki Amagasa, Hiroyuki Kitagawa

Data Mining II

Frontmatter
Parallel Eclat for Opportunistic Mining of Frequent Itemsets

Mining frequent itemsets is an essential data mining problem. As the big data era comes, the size of databases is becoming so large that traditional algorithms will not scale well. An approach to the issue is to parallelize the mining algorithm, which however is a challenge that has not been well addressed yet. In this paper, we propose a MapReduce-based algorithm, Peclat, that parallelizes the vertical mining algorithm, Eclat, with three improvements. First, Peclat proposes a hybrid vertical data format to represent the data, which saves both space and time in the mining process. Second, Peclat adopts the pruning technique from the Apriori algorithm to improve efficiency of breadth-first search. Third, Peclat employs an ordering of itemsets that helps balancing the workloads. Extensive experiments demonstrate that Peclat outperforms the existing MapReduce-based algorithms significantly.

Junqiang Liu, Yongsheng Wu, Qingfeng Zhou, Benjamin C. M. Fung, Fanghui Chen, Binxiao Yu
Sequential Data Analytics by Means of Seq-SQL Language

Ubiquitous devices and applications generate data, whose natural feature is order. Most of the commercial software and research prototypes for data analytics allow to analyze set oriented data, neglecting their order. However, by analyzing both data and their order dependencies, one can discover new business knowledge. Few solutions in this field have been proposed so far, and all of them lack a comprehensive approach to organize and process such data in a data warehouse-like manner. In this paper, we contribute an SQL-like query language for analyzing sequential data in an OLAP-like manner, its prototype implementation and performance evaluation.

Bartosz Bebel, Tomasz Cichowicz, Tadeusz Morzy, Filip Rytwiński, Robert Wrembel, Christian Koncilia
Clustering Attributed Multi-graphs with Information Ranking

Attributed multi-graphs are data structures to model real-world networks of objects which have rich properties/attributes and they are connected by multiple types of edges. Clustering attributed multi-graphs has several real-world applications, such as recommendation systems and targeted advertisement. In this paper, we propose an efficient method for Clustering Attributed Multi-graphs with Information Ranking, namely CAMIR. We introduce an iterative algorithm that ranks the different vertex attributes and edge-types according to how well they can separate vertices into clusters. The key idea is to consider the ‘agreement’ among the attribute- and edge-types, assuming that two vertex properties ‘agree’ if they produced the same clustering result when used individually. Furthermore, according to the calculated ranks we construct a unified similarity measure, by down-weighting noisy vertex attributes or edge-types that may reduce the clustering accuracy. Finally, to generate the final clusters, we follow a spectral clustering approach, suitable for graph partitioning and detecting arbitrary shaped clusters. In our experiments with synthetic and real-world datasets, we show the superiority of CAMIR over several state-of-the-art clustering methods.

Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos

Indexing and Decision Support Systems

Frontmatter
Building Space-Efficient Inverted Indexes on Low-Cardinality Dimensions

Many modern applications naturally lead to the implementation of inverted indexes for effectively managing large collections of data items. Creating an inverted index on a low cardinality data domain results in replication of data descriptors, leading to increased storage overhead. For example, the use of RFID or similar sensing devices in supply-chains results in massive tracking datasets that need effective spatial or spatio-temporal indexes on them. As the volume of data grows proportionally larger than the number of spatial locations or time epochs, it is unavoidable that many of the resulting lists share large subsets of common items. In this paper we present techniques that exploit this characteristic of modern big-data applications in order to losslessly compress the resulting inverted indexes by discovering large common item sets and adapting the index so as to store just one copy of them. We apply our method in the supply chain domain using modern big-data tools and show that our techniques in many cases achieve compression ratios that exceed 50 %.

Vasilis Spyropoulos, Yannis Kotidis
A Decision Support System for Hotel Facilities Inventory Management

A major goal of a tourism supply chain is a profitable collaboration between actors involved. Small hotel facilities tend to order small amounts of each good. The unit cost is generally unfavorable compared to that of large hotel facilities. To overcome this disadvantage, small facilities can collaborate by placing aggregate orders to a single vendor. Consequently, the increased quantity ordered can afford a unit cost reduction. This paper investigates the effectiveness of a set of novel demand forecast techniques for supporting this order aggregation. We describe four different algorithms: they all use the orders’ history; in addition, two of them forecast the numbers of guests. The performed tests use large amount of anonymous real-world data and show that the algorithms that also use the numbers of guests performs better than those based only on the orders’ history.

Giuseppe Monteleone, Raffaele Di Natale, Piero Conca, Salvatore Michele Biondi, Antonio Rosario Intilisano, Vincenzo Catania, Daniela Panno
TopCom: Index for Shortest Distance Query in Directed Graph

Finding shortest distance between two vertices in a graph is an important problem due to its numerous applications in diverse domains, including geo-spatial databases, social network analysis, and information retrieval. Classical algorithms (such as, Dijkstra) solve this problem in polynomial time, but these algorithms cannot provide real-time response for a large number of bursty queries on a large graph. So, indexing based solutions that pre-process the graph for efficiently answering (exactly or approximately) a large number of distance queries in real-time are becoming increasingly popular. Existing solutions have varying performance in terms of index size, index building time, query time, and accuracy. In this work, we propose

TopCom

, a novel indexing-based solution for exactly answering distance queries in a directed acyclic graph (DAG). Our experiments with two of the existing state-of-the-art methods (IS-Label and TreeMap) show the superiority of

TopCom

over these two methods considering scalability and query time.

Vachik S. Dave, Mohammad Al Hasan
A Universal Distributed Indexing Scheme for Data Centers with Tree-Like Topologies

The indices in the distributed storage systems manage the stored data and support diverse queries efficiently. Secondary index, the index built on the attributes other than the primary key, facilitates a variety of queries for different purposes. In this paper, we propose U

2

-Tree, a universal distributed secondary indexing scheme built on cloud storage systems with tree-like topologies. U

2

-Tree is composed of two layers, the global index and the local index. We build the local index according to the local data features, and then assign the potential indexing range of the global index for each host. After that, we use several techniques to publish the meta-data about local index to the global index host. The global index is then constructed based on the collected intervals. We take advantage of the topological properties of tree-like topologies, introduce and compare the detailed optimization techniques in the construction of two-layer indexing scheme. Furthermore, we discuss the index updating, index tuning, and the fault tolerance of U

2

-Tree. Finally, we propose numerical experiments to evaluate the performance of U

2

-Tree. The universal indexing scheme provides a general approach for secondary index on data centers with tree-like topologies. Moreover, many techniques and conclusions can be applied to other DCN topologies.

Yuang Liu, Xiaofeng Gao, Guihai Chen

Data Mining III

Frontmatter
Improving Diversity Performance of Association Rule Based Recommender Systems

In recommender systems (RSs), getting high accuracy alone does not improve the user satisfaction. In the literature, efforts are being made to improve the variety/diversity of recommendations for higher user satisfaction. In these efforts, the accuracy performance is reduced at the cost of improving the variety of recommendations. In this paper, we propose an approach to improve the diversity as well as accuracy of association rule based RSs. We propose a ranking mechanism, called diverse rank, to rank association rules based on the diversity of the items in the pattern. The recommendations are made based on association rules with high confidence and diversity. The experimental results on the real-world MovieLens data set show that the proposed approach improves the performance of association rule based RSs with better diversity without compromising the accuracy.

M. Kumara Swamy, P. Krishna Reddy
A Prime Number Based Approach for Closed Frequent Itemset Mining in Big Data

Mining big datasets poses a number of challenges which are not easily addressed by traditional mining methods, since both memory and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce, that allow to make powerful computing and storage units on top of ordinary machines. In this paper, we address the issue of mining closed frequent itemsets (CFI) from big datasets in such environments. We introduce a new parallel algorithm, called

CloPN

, for CFI mining. One of the worth of cite features of

CloPN

is that it uses a prime number based approach to transform the data into numerical form, and then to mine closed frequent itemsets by using only multiplication and division operations. We carried out exhaustive experiments over big real world datasets to assess the performance of

CloPN

. The obtained results highlight that our algorithm is very efficient in CFI mining from large real world datasets with up to 53 million articles.

Mehdi Zitouni, Reza Akbarinia, Sadok Ben Yahia, Florent Masseglia
Multilingual Documents Clustering Based on Closed Concepts Mining

The scarcity of bilingual and multilingual parallel corpora has prompted many researchers to accentuate the need for new methods to enhance the quality of comparable corpora. In this paper, we highlight the interest and usefulness of Formal Concept Analysis in multiligual document clustering to improve corpora comparability. We propose a statistical approach for clustering multiligual documents based on multilingual Closed Concepts Mining to partition the documents belonging to one or more collections, writing in more than one language, in a set of classes. Experimental evaluation was conducted on two collections and showed a significant improvement of comparability of the generated classes.

Mohamed Chebel, Chiraz Latiri, Eric Gaussier

Modeling, Extraction, Social Networks

Frontmatter
Analyzing the Strength of Co-authorship Ties with Neighborhood Overlap

Evaluating researchers’ scientific productivity usually relies on bibliometry only, which may not be always fair. Here, we take a step forward on analyzing such data by exploring the strength of co-authorship ties in social networks. Specifically, we build co-authorship social networks by extracting the datasets of three research areas (sociology, medicine and computer science) from a real digital library and analyze how topological properties relate to the strength of ties. Our results show that different topological properties explain variations in the strength of co-authorship ties, depending on the research area. Also, we show that neighborhood overlap can be applied to scientific productivity evaluation and analysis beyond bibliometry.

Michele A. Brandão, Mirella M. Moro
Event Extraction from Unstructured Text Data

We extend a bootstrapping method that was initially developed for extracting relations from webpages to the problem of extracting content from large collections of short unstructured text. Such data appear as field notes in enterprise applications and as messages in social media services. The method iteratively learns sentence patterns that match a set of representative event mentions and then extracts different mentions using the learnt patterns. At every step, the semantic similarity between the text and set of patterns is used to determine if the pattern was matched. Semantic similarity is calculated using the WordNet lexical database. Local structure features such as bigrams are extracted where possible from the data to improve the accuracy of pattern matching. We rank and filter the learnt patterns to balance the precision and recall of the approach with respect to extracted events. We demonstrate this approach on two different datasets. One is a collection of field notes from an enterprise dataset. The other is a collection of “tweets” collected from the Twitter social network. We evaluate the accuracy of the extracted events when method parameters are varied.

Chao Shang, Anand Panangadan, Viktor K. Prasanna
A Cluster-Based Epidemic Model for Retweeting Trend Prediction on Micro-blog

Tweets spread on social micro-blog bears some similarity to epidemic spread. Based on the findings from a user study on tweets’ short-term retweeting characteristics, we extend the classic Susceptible-Infected-Susceptible (SIS) epidemic model for tweet’s retweeting trend prediction, featured by the multiple retweeting peaks, retweeting lifetime, and total retweeting amount. We cluster micro-blog users with similar retweeting influence together, and train the model using the least square method on the historic retweeting datato obtain different groups’ retweeting rates. We demonstrate its effectiveness on a real micro-blog platform.

Zhuonan Feng, Yiping Li, Li Jin, Ling Feng
Backmatter
Metadata
Title
Database and Expert Systems Applications
Editors
Qiming Chen
Abdelkader Hameurlain
Farouk Toumani
Roland Wagner
Hendrik Decker
Copyright Year
2015
Electronic ISBN
978-3-319-22849-5
Print ISBN
978-3-319-22848-8
DOI
https://doi.org/10.1007/978-3-319-22849-5

Premium Partner