Zum Inhalt

Advances in Databases and Information Systems

21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Proceedings

  • 2017
  • Buch
insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 21st European Conference on Advances in Databases and Information Systems, ADBIS 2017, held in Nicosia, Cyprus, in September 2017.
The 26 regular papers presented together with one keynote paper and one keynote abstract were carefully selected and reviewed from numerous submissions. The papers are organized in topical sections such as conceptual modeling and human factors; subsequence matching and streaming data; OLAP; graph databases; spatial data management; parallel and distributed data processing; query optimization, recovery, and databases on modern hardware; semantic data processing; and additional database and information systems topics.

Inhaltsverzeichnis

Frontmatter

ADBIS 2017 - Keynote Papers

Frontmatter
Toward Model-Based Big Data-as-a-Service: The TOREADOR Approach
Abstract
The full potential of Big Data Analytics (BDA) can be unleashed only by overcoming hurdles like the high architectural complexity and lack of transparency of Big Data toolkits, as well as the high cost and lack of legal clearance of data collection, access and processing procedures. We first discuss the notion of Big Data Analytics-as-a-Service (BDAaaS) to help potential users of BDA in overcoming such hurdles. We then present TOREADOR, a first approach to BDAaaS.
Ernesto Damiani, Claudio Ardagna, Paolo Ceravolo, Nello Scarabottolo

Conceptual Modeling and Human Factors

Frontmatter
General and Specific Model Notions
Abstract
Models are a universal and widely used instrument in Computer Science and Computer Engineering. There is a large variety of notions of models. A model functions in a utilisation scenario as an instrument. It is well-formed, adequate and dependable. It represents or deputes origins. This conception of the model is a very general one. Based on the notion of a stereotype as a starting point we show that specific or particular model notions are specialisations of the general notion.
Bernhard Thalheim
“Is It a Fleet or a Collection of Ships?”: Ontological Anti-patterns in the Modeling of Part-Whole Relations
Abstract
Over the years, there is a growing interest in employing theories from philosophical ontology, cognitive science and linguistics to devise theoretical, methodological and computational tools for information systems engineering, in general, and for conceptual modeling, in particular. In this paper, we discuss one particular kind of such tools, namely, ontological anti-patterns. Ontological anti-patterns are error-problem modeling structures that can create a deviation between the possible and the intended interpretations of a model. In this paper, we present two empirically elicited ontological anti-patterns related to the modeling of part-whole relations. In particular, these anti-patterns identify possible mistakes in the modeling of collectives (complex entities that have a uniform role-based structure) and functional complexes (complex entities composed of functional parts). Besides identifying these anti-patterns, the paper presents a series of rectification plans that can be used to eliminate their occurrence in models. Finally, we present a model-based computational tool that supports the automated detection, analysis and elimination of these anti-patterns.
Tiago Prince Sales, Giancarlo Guizzardi
Context-Aware Decision Information Packages: An Approach to Human-Centric Smart Factories
Abstract
Industry 4.0 enables the integration of new trends, such as data-intensive cyber physical systems, Internet of Things, or mobile applications, into production environments. Although it concentrates on highly data-intensive automated engineering and manufacturing processing, the human actor is still important for decision making in the product lifecycle process. To support correct and efficient decision making, human actors have to be provided with relevant data depending on the current context. This data needs to be retrieved from distributed sources like bill of material systems, product data management and manufacturing execution systems, holding product model and factory model. In this paper, we address this issue by introducing the concept of decision information packages, which enable to compose relevant engineering data for a specific context from distributed data sources. To determine relevant data, we specify a context-aware engineering data model and corresponding operators. To realize our approach, we provide an architecture and a prototypical implementation based on requirements of a real case scenario.
Eva Hoos, Pascal Hirmer, Bernhard Mitschang

Subsequence Matching and Streaming Data

Frontmatter
Fast Subsequence Matching in Motion Capture Data
Abstract
Motion capture data digitally represent human movements by sequences of body configurations in time. Subsequence matching in such spatio-temporal data is difficult as query-relevant motions can vary in lengths and occur arbitrarily in a very long motion. To deal with these problems, we propose a new subsequence matching approach which (1) partitions both short query and long data motion into fixed-size segments that overlap only partly, (2) uses an effective similarity measure to efficiently retrieve data segments that are the most similar to query segments, and (3) localizes the most query-relevant subsequences within extended and merged retrieved segments in a four-step postprocessing phase. The whole retrieval process is effective and fast in comparison with related work. A real-life 68-minute data motion can be searched in about 1 s with the average precision of \(87.98\%\) for 5-NN queries.
Jan Sedmidubsky, Pavel Zezula, Jan Svec
Interactive Time Series Subsequence Matching
Abstract
We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence matching queries over a time series data set. Compared to a naïve implementation, our index has a storage cost that is up to two orders of magnitude smaller, while providing answers within microseconds. We demonstrate the efficiency and effectiveness of our technique in an experimental evaluation with real-world data.
Danila Piatov, Sven Helmer, Johann Gamper
Generating Fixed-Size Training Sets for Large and Streaming Datasets
Abstract
The k Nearest Neighbor is a popular and versatile classifier but requires a relatively small training set in order to perform adequately, a prerequisite not satisfiable with the large volumes of training data that are nowadays available from streaming environments. Conventional Data Reduction Techniques that select or generate training prototypes are also inappropriate in such environments. Dynamic RHC (dRHC) is a prototype generation algorithm that can update its condensing set when new training data arrives. However, after repetitive updates, the size of the condensing set may become unpredictably large. This paper proposes dRHC2, a new variation of dRHC, which remedies the aforementioned drawback. dRHC2 keeps the size of the condensing set in a convenient, manageable by the classifier, level by ranking the prototypes and removing the least important ones. dRHC2 is tested on several datasets and the experimental results reveal that it is more efficient and noise tolerant than dRHC and is comparable to dRHC in terms of accuracy.
Stefanos Ougiaroglou, Georgios Arampatzis, Dimitris A. Dervos, Georgios Evangelidis

OLAP

Frontmatter
Detecting User Focus in OLAP Analyses
Abstract
In this paper, we propose an approach to automatically detect focused portions of data cube explorations by using different features of OLAP queries. While such a concept of focused interaction is relevant to many domains besides OLAP explorations, like web search or interactive database exploration, there is currently no precise formal, commonly agreed definition. This concept of focus phase is drawn from Exploratory Search, which is a paradigm that theorized search as a complex interaction between a user and a system. The interaction consists of two different phases: an exploratory phase where the user is progressively defining her information need, and a focused phase where she investigates in details precise facts, and learn from this investigation. Following this model, our work is, to the best of our knowledge, the first to propose a formal feature-based description of a focused query in the context of interactive data exploration. Our experiments show that we manage to identify focused queries in real navigations, and that our model is sufficiently robust and general to be applied to different OLAP navigations datasets.
Mahfoud Djedaini, Nicolas Labroche, Patrick Marcel, Verónika Peralta
Sparse Prefix Sums
Abstract
The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
Michael Shekelyan, Anton Dignös, Johann Gamper
Targeted Feedback Collection Applied to Multi-Criteria Source Selection
Abstract
A multi-criteria source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets a user’s needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. A MCSS problem can be solved using multi-dimensional optimisation techniques that trade-off the different objectives. Sometimes we may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, we may rely on end users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this paper, we introduce an approach called Targeted Feedback Collection (TFC), which aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. TFC targets feedback by considering the confidence intervals around the estimated criteria values. The TFC strategy has been evaluated, with promising results, against other approaches to feedback collection, including active learning, using real-world data sets.
Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, Edward Abel, John A. Keane

Graph Databases

Frontmatter
Cost Model for Pregel on GraphX
Abstract
The graph partitioning strategy plays a vital role in the overall execution of an algorithm in a distributed graph processing system. Choosing the best strategy is very challenging, as no one strategy is always the best fit for all kinds of graphs or algorithms. In this paper, we help users choosing a suitable partitioning strategy for algorithms based on the Pregel model by providing a cost model for the Pregel implementation in Spark-GraphX. The cost model shows the relationship between four major parameters: (1) input graph (2) cluster configuration (3) algorithm properties and (4) partitioning strategy. We validate the accuracy of the cost model on 17 different combinations of input graph, algorithm, and partition strategy. As such, the cost model can serve as a basis for yet to be developed optimizers for Pregel.
Rohit Kumar, Alberto Abelló, Toon Calders
Historical Traversals in Native Graph Databases
Abstract
Since most graph data, such as data from social, citation and computer networks evolve over time, it is useful to be able to query their history. In this paper, we focus on supporting traversals of such graphs using a native graph database. We assume that we are given the history of an evolving graph as a sequence of graph snapshots representing the state of the graph at different time instances. We introduce models for storing such snapshots in the graph database and we propose algorithms for supporting various types of historical reachability and shortest path queries. Finally, we experimentally evaluate and compare the various models and algorithms using both real and synthetic datasets.
Konstantinos Semertzidis, Evaggelia Pitoura
Formalising openCypher Graph Queries in Relational Algebra
Abstract
Graph database systems are increasingly adapted for storing and processing heterogeneous network-like datasets. However, due to the novelty of such systems, no standard data model or query language has yet emerged. Consequently, migrating datasets or applications even between related technologies often requires a large amount of manual work or ad-hoc solutions, thus subjecting the users to the possibility of vendor lock-in. To avoid this threat, vendors are working on supporting existing standard languages (e.g. SQL) or standardising languages.
In this paper, we present a formal specification for openCypher, a high-level declarative graph query language with an ongoing standardisation effort. We introduce relational graph algebra, which extends relational operators by adapting graph-specific operators and define a mapping from core openCypher constructs to this algebra. We propose an algorithm that allows systematic compilation of openCypher queries.
József Marton, Gábor Szárnyas, Dániel Varró

Spatial Data Management

Frontmatter
SliceNBound: Solving Closest Pairs and Distance Join Queries in Apache Spark
Abstract
The (K) Closest-Pair(s) Query, KCPQ, consists in finding the (K) closest pair(s) of objects between two spatial datasets. Recently, several systems that enhance Apache Spark with spatial-awareness have been presented, providing a variety of queries for spatial computation, but not the KCPQ. Since queries are of different nature and one processing technique does not fit all cases, we need specialized algorithms for specific queries that exploit the power provided by parallel systems such as Apache Spark. This paper addresses the problem of answering the KCPQ in Apache Spark, by presenting such a specialized, fast algorithm that can easily be imported in any, spatial-oriented or general, Spark-based system. Furthermore, it presents a variant of this algorithm that solves the Distance Join Query. Experiments and comparison to other solutions indicate that our method is fast and efficient.
George Mavrommatis, Panagiotis Moutafis, Michael Vassilakopoulos, Francisco García-García, Antonio Corral
A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries
Abstract
Due to the ubiquitous use of spatial data applications and the large amounts of spatial data that these applications generate, the processing of large-scale distance joins in distributed systems is becoming increasingly popular. Two of the most studied distance join queries are the K Closest Pair Query (KCPQ) and the \(\varepsilon \) Distance Join Query (\(\varepsilon \) DJQ). The KCPQ finds the K closest pairs of points from two datasets and the \(\varepsilon \) DJQ finds all the possible pairs of points from two datasets, that are within a distance threshold \(\varepsilon \) of each other. Distributed cluster-based computing systems can be classified in Hadoop-based and Spark-based systems. Based on this classification, in this paper, we compare two of the most current and leading distributed spatial data management systems, namely SpatialHadoop and LocationSpark, by evaluating the performance of existing and newly proposed parallel and distributed distance join query algorithms in different situations with big real-world datasets. As a general conclusion, while SpatialHadoop is more mature and robust system, LocationSpark is the winner with respect to the total execution time.
Francisco García-García, Antonio Corral, Luis Iribarne, George Mavrommatis, Michael Vassilakopoulos
A Generic and Efficient Framework for Spatial Indexing on Flash-Based Solid State Drives
Abstract
Speeding up the spatial query processing on flash-based Solid State Drives (SSDs) has become a core problem in spatial database applications, and has been carried out aided by flash-aware spatial indices. Although there are some existing flash-aware spatial indices, they do not exploit all the benefits of SSDs, leading to loss of efficiency. In this paper, we propose a new generic and efficient Framework for spatial INDexing on SSDs, called eFIND. It takes into account all the intrinsic characteristics of SSDs by employing (i) a write buffer to avoid random writes; (ii) a read buffer to decrease the overhead of random reads; (iii) a temporal control to avoid interleaved reads and writes; (iv) a flushing policy based on the characteristics of the indexed spatial objects; and (v) a log-structured approach to provide data durability. Performance tests showed that eFIND is very efficient. Compared to existing indices, eFIND improved the construction of spatial indices from 22% up to 68% and the spatial query processing from 3% up to 50%.
Anderson Chaves Carniel, Ricardo Rodrigues Ciferri, Cristina Dutra de Aguiar Ciferri

Parallel and Distributed Data Processing

Frontmatter
Incremental Frequent Itemsets Mining with MapReduce
Abstract
Frequent itemsets mining is a common task in data mining. Since sizes of today’s databases go far beyond capabilities of a single machine, recent studies show how to adopt classical algorithms for frequent itemsets mining for parallel frameworks such as MapReduce. Even then, in case of a slight database update a re-run of the MapReduce mining algorithm from the beginning on the whole data set is required and could be far from optimal. Thus, a variation of these algorithms for incremental database update is desired.
The current paper presents a general algorithm for incremental frequent itemsets mining and shows how to adapt it to the parallel paradigm. It also provides optimizations that are unique to a constrained model of MapReduce for an effective algorithm.
Kirill Kandalov, Ehud Gudes
Towards High Similarity Search Throughput by Dynamic Query Reordering and Parallel Processing
Abstract
Current era of digital data explosion calls for employment of content-based similarity search techniques since traditional searchable metadata like annotations are not always available. In our work, we focus on a scenario where the similarity search is used in the context of stream processing, which is one of the suitable approaches to deal with huge amounts of data. Our goal is to maximize the throughput of processed queries while a slight delay is acceptable. We extend our previously published technique that dynamically reorders the incoming queries in order to use our caching mechanism more effectively. The extension lies in adoption of a parallel computing environment which allows us to process multiple queries simultaneously.
Filip Nalepa, Michal Batko, Pavel Zezula
Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution
Abstract
Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all matching entities within clusters. While there are many possible clustering schemes for entity resolution, their relative suitability and scalability is still unclear. We therefore implemented and comparatively evaluate distributed versions of six clustering schemes based on Apache Flink within a new entity resolution framework called Famer. Our evaluation for different real-life and synthetically generated datasets considers both the match quality as well as the scalability for different number of machines and data sizes.
Alieh Saeedi, Eric Peukert, Erhard Rahm

Query Optimization, Recovery, and Databases on Modern Hardware

Frontmatter
Cost-Function Complexity Matters: When Does Parallel Dynamic Programming Pay Off for Join-Order Optimization
Abstract
The execution time of queries can vary by several orders of magnitude depending on the join order. Hence, an efficient query execution can be ensured by determining optimal join orders. Dynamic programming determines optimal join orders efficiently. Unfortunately, the runtime of dynamic programming depends on the characteristics of the query, limiting the applicability to simple optimization problems. To extend the applicability, different parallelization strategies were proposed. Although existing parallelization strategies showed benefits for complex cost functions, the effects of the cost-function complexity was not evaluated.
Therefore, in this paper, we compare different sequential and parallel dynamic programming variants with respect to different query characteristics and cost-function complexities. We show that the parallelization of a parallel dynamic programming variant is most often only useful for complex cost functions. For simple cost functions, we show that most often sequential variants are superior to their parallel counterparts.
Andreas Meister, Gunter Saake
Instant Restore After a Media Failure
Abstract
Media failures usually leave database systems unavailable for several hours until recovery is complete, especially in applications with large devices and high transaction volume. Previous work introduced a technique called single-pass restore, which increases restore bandwidth and thus substantially decreases time to repair. Instant restore goes further as it permits read/write access to any data on a device undergoing restore—even data not yet restored—by restoring individual data segments on demand. Thus, the restore process is guided primarily by the needs of applications, and the observed mean time to repair is effectively reduced from several hours to a few seconds.
This paper presents an implementation and evaluation of instant restore. The technique is incrementally implemented on a system starting with the traditional ARIES design for logging and recovery. Experiments show that the transaction latency perceived after a media failure can be cut down to less than a second. The net effect is that a few “nines” of availability are added to the system using simple and low-overhead software techniques.
Caetano Sauer, Goetz Graefe, Theo Härder
Rethinking DRAM Caching for LSMs in an NVRAM Environment
Abstract
The rise of NVRAM technologies promises to change the way we think about system architectures. In order to fully exploit its advantages, it is required to develop systems specially tailored for NVRAM devices. Not only this imposes great challenges, but also developing full system architectures from scratch is undesirable in many scenarios due to prohibitive development costs. Instead, we analyze in this paper the behavior of an existing log-structured persistent key-value store, namely LevelDB, when run on top of an emulated NVRAM device. We investigate initial opportunities for improvement when adapting a system tailored for HDD/SSDs to run on top of an NVRAM environment. Furthermore, we analyze the behavior of the DRAM caching components of LevelDB and whether more suitable caching policies are required.
Lucas Lersch, Ismail Oukid, Ivan Schreter, Wolfgang Lehner

Semantic Data Processing

Frontmatter
SPARQL Query Containment with ShEx Constraints
Abstract
ShEx (Shape Expressions) is a language for expressing constraints on RDF graphs. We consider the problem of SPARQL query containment in the presence of ShEx constraints. We first propose a sound and complete procedure for the problem of containment with ShEx, considering several SPARQL fragments. Particularly our procedure considers OPTIONAL query patterns, that turns out to be an important fragment to be studied with schemas. We then show the complexity bounds of our problem with respect to the fragments considered. To the best of our knowledge, this is the first work addressing SPARQL query containment in the presence of ShEx constraints.
Abdullah Abbas, Pierre Genevès, Cécile Roisin, Nabil Layaïda
Updating RDF/S Databases Under Constraints
Abstract
We address the issue of database updating in the presence of constraints, using the first order logic formalism. Based on a chasing technique, we propose a deterministic update strategy. While generalizing key-foreign key constraints, our approach satisfies the consistency and minimal change requirements, with a polynomial time complexity. Our custom version of the chase allows improvement in updating non-null RDF/S databases with constraints.
Mirian Halfeld-Ferrari, Dominique Laurent

Additional Database and Information Systems Topics

Frontmatter
Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation
Abstract
Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements.
Andrés Sanoja, Stéphane Gançarski
A Tool for Design-Time Usability Evaluation of Web User Interfaces
Abstract
The diversity of smartphones and tablet computers has become intrinsic part of modern life. Following usability guidelines while designing web user interface (UI) is an essential requirement for each web application. Even a minor change in UI could lead to usability problems, e.g. changing background or foreground colour of buttons could cause usability problems especially for people with disabilities. Empirical evaluation methods such as questionnaires and Card Sorting are effective in finding such problems. Nevertheless, these methods cannot be used widely when time, money and evaluators are scarce. The purpose of our work is to deliver a tool for design-time automatic evaluation of UI conformance to category-specific usability guidelines. The main contribution of this solution is enabling immediate cost-efficient and automatic web UI evaluation that conforms to available and set standards. This approach is being integrated into the Estonian eGovernment authority in order to automate usability evaluation of web applications.
Jevgeni Marenkov, Tarmo Robal, Ahto Kalja
Genotypic Data in Relational Databases: Efficient Storage and Rapid Retrieval
Abstract
As technologies to produce genotypic data have become less expensive, the widths and depths of such data have sharply increased. Relational databases have performed poorly in this domain. Data storage and retrieval is now mostly conducted by highly coupled and specialized software packages and file formats, but relational databases offer advantages if the domain challenges can be overcome. We revisit their feasibility as a tool for efficiently storing and querying extremely large genotypic data sets. We describe a technique for managing genotypic data in the PostgreSQL relational database, compare it to common existing techniques for storing and querying genotypic data, and demonstrate that it can greatly reduce both query times and storage requirements.
Ryan N. Lichtenwalter, Katerina Zorina-Lichtenwalter, Luda Diatchenko
Backmatter
Titel
Advances in Databases and Information Systems
Herausgegeben von
Mārīte Kirikova
Kjetil Nørvåg
George A. Papadopoulos
Copyright-Jahr
2017
Electronic ISBN
978-3-319-66917-5
Print ISBN
978-3-319-66916-8
DOI
https://doi.org/10.1007/978-3-319-66917-5

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH