Skip to main content

2018 | Buch

Advances in Databases and Information Systems

22nd European Conference, ADBIS 2018, Budapest, Hungary, September 2–5, 2018, Proceedings

herausgegeben von: Prof. András Benczúr, Prof. Dr. Bernhard Thalheim, Tomáš Horváth

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 22nd European Conference on Advances in Databases and Information Systems, ADBIS 2018, held in Budapest, Hungary, in September 2018.

The 17 regular papers presented together with two invited papers were carefully selected and reviewed from numerous submissions. The papers are organized in topical sections such as information extraction and integration; data mining and knowledge discovery; indexing, query processing and optimization; data quality and data cleansing; distributed data platforms, including cloud data systems, key-value stores, and big data systems; and streaming data analysis; web, XML and semi-structured databases.

Inhaltsverzeichnis

Frontmatter

Invited Papers

Frontmatter
Database-Centric Scientific Computing
(In Memoriam Jim Gray)
Abstract
Working with Jim Gray, we set out more than 20 years ago to design and build the archive for the Sloan Digital Sky Survey (SDSS), the SkyServer. The SDSS project collected a huge data set over a large fraction of the Northern Sky and turned it into an open resource for the world’s astronomy community. Over the years the project has changed astronomy. Now the project is faced with the problem of how to ensure that the data will be preserved and kept alive for active use for another 15 to 20 years. At the time there were very few examples to learn from and we had to invent much of the system ourselves. The paper discusses the lessons learned, future directions and recalls some memorable moments of our collaboration.
Alexander S. Szalay
Spatio-Temporal Data Mining of Major European River and Mountain Names Reveals Their Near Eastern and African Origins
Abstract
This paper presents a spatio-temporal data mining regarding the origin of the names of the 218 longest European rivers. The study shows that 35.2% of these river names originate in the Near East and Southern Caucasus. The study also investigates the origin of European mountain names. It is shown that at least 26 mountain names originate from Africa.
Peter Z. Revesz

Information Extraction and Integration

Frontmatter
Query Rewriting for Heterogeneous Data Lakes
Abstract
The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.
Rihan Hai, Christoph Quix, Chen Zhou
RawVis: Visual Exploration over Raw Data
Abstract
Data exploration and visual analytics systems are of great importance in Open Science scenarios, where less tech-savvy researchers wish to access and visually explore big raw data files (e.g., json, csv) generated by scientific experiments using commodity hardware and without being overwhelmed in the tedious processes of data loading, indexing and query optimization. In this work, we present our work for enabling efficient query processing on raw data files for interactive visual exploration scenarios. We introduce a framework, named RawVis, built on top of a lightweight in-memory tile-based index, VALINOR, that is constructed on-the-fly given the first user query over a raw file and adapted based on the user interaction. We evaluate the performance of prototype implementation compared to three other alternatives and show that our method outperforms in terms of response time, disk accesses and memory consumption.
Nikos Bikakis, Stavros Maroulis, George Papastefanatos, Panos Vassiliadis

Data Mining and Knowledge Discovery

Frontmatter
Extended Margin and Soft Balanced Strategies in Active Learning
Abstract
Nowadays active learning is gaining increasing interest in computer vision community, especially on images. The most commonly used query strategy framework is uncertainty sampling usually in a pool-based sampling scenario. In this paper we propose two query strategies for image classification under the uncertainty sampling framework, both of them being improvements of existing techniques. The first strategy, so called Extended Margin incorporates all possible class labels to calculate the informativeness values of unlabeled instances. The second strategy is the improvement of the recently published BAL method, so called Soft Balanced approach, where we suggest new final informativeness score from an uncertainty measure and a novel penalty metric. We used least margin criterion for the former and the latter was calculated from the categorical penalty scores by using soft assignment. We conducted experiments on 60 different test image sets, each of them was a randomly selected subset of the Caltech101 image collection. The experiments were performed in an extended active learning environment and the results showed that the Extended Margin outperforms the least margin approach and the Soft Balanced method overcomes all other competitor method.
Dávid Papp, Gábor Szűcs
Location-Awareness in Time Series Compression
Abstract
We present our initial findings regarding the problem of the impact that time series compression may have on similarity-queries, in the settings in which the elements of the dataset are accompanied with additional contexts. Broadly, the main objective of any data compression approach is to provide a more compact (i.e., smaller size) representation of a given original dataset. However, as has been observed in the large body of works on compression of spatial data, applying a particular algorithm “blindly” may yield outcomes that defy the intuitive expectations – e.g., distorting certain topological relationships that exist in the “raw” data [7]. In this study, we quantify this distortion by defining a measure of similarity distortion based on Kendall’s \(\tau \). We evaluate this measure, and the correspondingly achieved compression ratio for the five most commonly used time series compression algorithms and the three most common time series similarity measures. We report some of our observations here, along with the discussion of the possible broader impacts and the challenges that we plan to address in the future.
Xu Teng, Andreas Züfle, Goce Trajcevski, Diego Klabjan

Indexing, Query Processing and Optimization

Frontmatter
Efficient SPARQL Evaluation on Stratified RDF Data with Meta-data
Abstract
The Resource Description Framework (RDF) is a simple, but frequently used W3C standard, which uses triplets to define relationships between resources. In this paper the evaluation of queries in the query language SPARQL on RDF data with meta-data is investigated. We first show that if the data are stratified, i.e. a particular partial order can be defined on the meta-data labels, then a nesting procedure can be applied, which induces a rewriting of the query. Based on a specification by an Abstract State Machine we show that the result of the rewritten query equals the one that would have resulted from the evaluation of the original query. We further investigate the reduction of complexity by using data and query nesting.
Flavio Ferrarotti, Senén González, Klaus-Dieter Schewe
SIMD Vectorized Hashing for Grouped Aggregation
Abstract
Grouped aggregation is a commonly used analytical function. The common implementation of the function using hashing techniques suffers lower throughput rate due to the collision of the insert keys in the hashing techniques. During collision, the underlying technique searches for an alternative location to insert keys. Searching an alternative location increases the processing time for an individual key thereby degrading the overall throughput. In this work, we use Single Instruction Multiple Data (SIMD) vectorization to search multiple slots at an instant followed by direct aggregation of results. We provide our experimental results of our vectorized grouped aggregation with various open-addressing hashing techniques using several dataset distributions and our inferences on them. Among our findings, we observe different impacts of vectorization on these techniques. Namely, linear probing and two-choice hashing improve their performance with vectorization, whereas cuckoo and hopscotch hashing show a negative impact. Overall, we provide in this work a basic structure of a dedicated SIMD accelerated grouped aggregation framework that can be adapted with different hashing techniques.
Bala Gurumurthy, David Broneske, Marcus Pinnecke, Gabriel Campero, Gunter Saake
Selecting Sketches for Similarity Search
Abstract
Techniques of the Hamming embedding, producing bit string sketches, have been recently successfully applied to speed up similarity search. Sketches are usually compared by the Hamming distance, and applied to filter out non-relevant objects during the query evaluation. As several sketching techniques exist and each can produce sketches with different lengths, it is hard to select a proper configuration for a particular dataset. We assume that the (dis)similarity of objects is expressed by an arbitrary metric function, and we propose a way to efficiently estimate the quality of sketches using just a small sample set of data. Our approach is based on a probabilistic analysis of sketches which describes how separated are objects after projection to the Hamming space.
Vladimir Mic, David Novak, Lucia Vadicamo, Pavel Zezula
On the Support of the Similarity-Aware Division Operator in a Commercial RDBMS
Abstract
The division operator from the relational algebra allows simple and intuitive representation of queries with the concept of “for all”, and thus it is required in many real applications. However, the relational division is unable to support the needs of modern applications that manipulate complex data, such as images, audio, long texts, genetic sequences, etc. These data are better compared by similarity, whereas relational algebra always compares data by equality or inequality. Recent works focus on extending relational operators to support similarity comparisons and their inclusion in relational database management systems. This work incorporates and studies the behavior of several similarity-aware division algorithms in a commercial RDBMS. We compared the two state-of-art algorithms against several SQL statements and found when to use each one of them in order to improve query time execution. We then propose an extension of the SQL syntax and the query analyzer to support this new operator.
Guilherme Q. Vasconcelos, Daniel S. Kaster, Robson L. F. Cordeiro

Data Quality and Data Cleansing

Frontmatter
Data Quality in a Big Data Context
Abstract
In each of the phases of a Big Data analysis process, data quality (DQ) plays a key role. Given the particular characteristics of the data at hand, the traditional DQ methods used for relational databases, based on quality dimensions and metrics, must be adapted and extended, in order to capture the new characteristics that Big Data introduces. This paper dives into this problem, re-defining the DQ dimensions and metrics for a Big Data scenario, where data may arrive, for example, as unstructured documents in real time. This general scenario is instantiated to study the concrete case of Twitter feeds. Further, the paper also describes the implementation of a system that acquires tweets in real time, and computes the quality of each tweet, applying the quality metrics that are defined formally in the paper. The implementation includes a web user interface that allows filtering the tweets for example by keywords, and visualizing the quality of a data stream in many different ways. Experiments are performed and their results discussed.
Franco Arolfo, Alejandro Vaisman
Integrating Approximate String Matching with Phonetic String Similarity
Abstract
Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.
Junior Ferri, Hegler Tissot, Marcos Didonet Del Fabro

Distributed Data Platforms, Including Cloud Data Systems, Key-Value Stores, and Big Data Systems

Frontmatter
Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs
Abstract
In data analytics, researchers often work on the same data-sets investigating different aspects and moreover develop their programs in an incremental manner. This opens opportunities to share and recycle results from previously executed jobs if they contain identical operations, e.g., restructuring, filtering and other kinds of data preparation.
In this paper, we present an approach to accelerate processing of such dataflow programs by materializing and recycling (intermediate) results in Apache Spark. We have implemented this idea in our Pig Latin compiler for Spark called Piglet which transparently supports both, merging of multiple jobs as well as rewriting jobs to reuse intermediate results. We discuss the opportunities for recycling, present a profiling-based cost model as well as a decision model to identify potentially beneficial materialization points. Finally, we report results of our experimental evaluation showing the validity of the cost model and the benefit of recycling.
Stefan Hagedorn, Kai-Uwe Sattler
ATUN-HL: Auto Tuning of Hybrid Layouts Using Workload and Data Characteristics
Abstract
Ad-hoc analysis implies processing data in near real-time. Thus, raw data (i.e., neither normalized nor transformed) is typically dumped into a distributed engine, where it is generally stored into a hybrid layout. Hybrid layouts divide data into horizontal partitions and inside each partition, data are stored vertically. They keep statistics for each horizontal partition and also support encoding (i.e., dictionary) and compression to reduce the size of the data. Their built-in support for many ad-hoc operations (i.e., selection, projection, aggregation, etc.) makes hybrid layouts the best choice for most operations.
Horizontal partition and dictionary sizes of hybrid layouts are configurable and can directly impact the performance of analytical queries. Hence, their default configuration cannot be expected to be optimal for all scenarios. In this paper, we present ATUN-HL (Auto TUNing Hybrid Layouts), which based on a cost model and given the workload and the characteristics of data, finds the best values for these parameters. We prototyped ATUN-HL for Apache Parquet, which is an open source implementation of hybrid layouts in Hadoop Distributed File System, to show its effectiveness. Our experimental evaluation shows that ATUN-HL provides on average 85% of all the potential performance improvement, and 1.2x average speedup against default configuration.
Rana Faisal Munir, Alberto Abelló, Oscar Romero, Maik Thiele, Wolfgang Lehner
Set Similarity Joins with Complex Expressions on Distributed Platforms
Abstract
A set similarity join finds all similar pairs from a collection of sets. This operation is essential for many important tasks in Big Data analytics including string data integration and cleaning. The vast majority of set similarity join algorithms proposed so far considers string data represented by a single set over which a simple similarity predicate is defined. However, real data is typically multi-attribute and, thus, better represented by multiple sets. Such a representation requires complex expressions to capture a given notion of similarity. Moreover, similarity join processing under this new formulation is clearly more expensive, which calls for distributed algorithms to deal with large datasets. In this paper, we present a distributed algorithm for set similarity joins with complex similarity expressions. Our approach supports complex Boolean expressions over multiple predicates. We propose a simple, but effective data partitioning strategy to reduce both communication and computation costs. We have implemented our algorithm in Spark, a popular distributed data processing engine. Experimental results show that the proposed approach is efficient and scalable.
Diego Junior do Carmo Oliveira, Felipe Ferreira Borges, Leonardo Andrade Ribeiro, Alfredo Cuzzocrea

Streaming Data Analysis

Frontmatter
Deterministic Model for Distributed Speculative Stream Processing
Abstract
Users of modern distributed stream processing systems have to choose between non-deterministic computations and high latency due to a need in excessive buffering. We introduce a speculative model based on MapReduce-complete set of operations that allows us to achieve determinism and low-latency. Experiments show that our prototype can outperform existing solutions due to low overhead of optimistic synchronization.
Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, Boris Novikov
Large-Scale Real-Time News Recommendation Based on Semantic Data Analysis and Users’ Implicit and Explicit Behaviors
Abstract
Online news portals constantly produce a huge amount of content about different events and topics. In such data streams scenarios, delivering relevant recommendations that best suit each user’s interests is a challenging task. Indeed, tight-time constraints and highly dynamic conditions in these environments make traditional batch recommendation approaches ineffective. In this paper, we present a scalable news recommendation system that takes into account data semantics, trending topics, users’ behaviors and the usage context in order to (1) model news articles, (2) infer users’ preferences and (3) provide real-time suggestions. In fact, our proposal is based on the semantic analysis of news articles’ content in order to extract relevant keywords and referenced named entities. This information is then used to model users’ interests by analyzing their attitudes while interacting with the available content. Moreover, our proposition accounts for the temporal variance of a news article’s utility by considering its freshness, popularity and attractiveness. To prove our proposition’s quality, scalability and efficiency in real-time data streaming environments, it was evaluated during the CLEF-NEWSREEL challenge connecting recommender systems to an active large-scale news delivery platform. Experiment results show that our system produces high quality and reliable performances in such dynamic environments.
Hemza Ficel, Mohamed Ramzi Haddad, Hajer Baazaoui Zghal

Web, XML and Semi-structured Databases

Frontmatter
MatBase Constraint Sets Coherence and Minimality Enforcement Algorithms
Abstract
MatBase is a prototype data and knowledge base management system based on the Relational, Entity-Relationship, and (Elementary) Mathematical Data Models. The latter distinguishes itself especially by its rich panoply made of 70 constraint types. They provide database and software application designers with the tools necessary for capturing and enforcing all business rules from any sub-universe of discourse, thus guaranteeing database instances plausibility. When dealing with such a wealth of constraint types, both incoherencies and redundancies are possible. As only coherent constraint sets are acceptable and minimal ones are desirable, this paper proposes as fast as possible table-driven algorithms for assisting enforcement of coherence and guaranteeing minimality of such constraint sets, which are implemented in the current MatBase versions. The paper also discusses their complexity and utility, both in the study of sets, functions, and relations semi-naïve algebra and first order predicate calculus with equality, as well as, especially, in data modeling, database constraint theory, advanced database management systems, database design, and database software application development.
Christian Mancas
Integration of Unsound Data in P2P Systems
Abstract
This paper is placed among the works on semantic peer data management systems. It proposes a different perspective in which a peer of a P2P system may consider its own data to be more trustable with respect to data provided by external peers (sound peer) or less trustable (unsound peer). The paper generalizes previous works of the same authors and proposes a new semantics capturing this simple idea.
Luciano Caroprese, Ester Zumpano
Backmatter
Metadaten
Titel
Advances in Databases and Information Systems
herausgegeben von
Prof. András Benczúr
Prof. Dr. Bernhard Thalheim
Tomáš Horváth
Copyright-Jahr
2018
Electronic ISBN
978-3-319-98398-1
Print ISBN
978-3-319-98397-4
DOI
https://doi.org/10.1007/978-3-319-98398-1