nach oben

2015 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Science

30th British International Conference on Databases, BICOD 2015, Edinburgh, UK, July 6-8, 2015, Proceedings

herausgegeben von: Sebastian Maneth

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed conference proceedings of the 30th British International Conference on Databases, BICOD 2015 - formerly known as BNCOD (British National Conference on Databases) - held in Edinburgh, UK, in July 2015.
The 19 revised full papers, presented together with three invited
keynotes and three invited lectures were carefully reviewed and selected from 37 submissions. Special focus of the conference has been "Data Science" and so the papers cover a wide range of topics related to databases and data-centric computation.

Inhaltsverzeichnis

Frontmatter

Invited Lectures

Frontmatter

Streaming Methods in Data Analysis

Abstract

A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed, streaming data. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This talk introduces the concepts and examples of compact summaries.

Graham Cormode

Data Integration

Frontmatter

A Framework for Scalable Correlation of Spatio-temporal Event Data

Abstract

Spatio-temporal event data do not only arise from sensor readings, but also in information retrieval and text analysis. However, such events extracted from a text corpus may be imprecise in both dimensions. In this paper we focus on the task of event correlation, i.e., finding events that are similar in terms of space and time. We present a framework for Apache Spark that provides correlation operators that can be configured to deal with such imprecise event data.

Stefan Hagedorn, Kai-Uwe Sattler, Michael Gertz

Towards More Data-Aware Application Integration

Abstract

Although most business application data is stored in relational databases, programming languages and wire formats in integration middleware systems are not table-centric. Due to costly format conversions, data-shipments and faster computation, the trend is to “push-down” the integration operations closer to the storage representation. We address the alternative case of defining declarative, table-centric integration semantics within standard integration systems. For that, we replace the current operator implementations for the well-known Enterprise Integration Patterns by equivalent “in-memory” table processing, and show a practical realization in a conventional integration system for a non-reliable, “data-intensive” messaging example. The results of the runtime analysis show that table-centric processing is promising already in standard, “single-record” message routing and transformations, and can potentially excel the message throughput for “multi-record” table messages.

Daniel Ritter

Applying NoSQL Databases for Integrating Web Educational Stores - An Ontology-Based Approach

Abstract

Educational content available on the web is playing an important role in the teaching and learning process. Learners search for different types of learning objects such as videos, pictures, and blog articles and use them to understand concepts they are studying in books and articles. The current search platforms provided can be frustrating to use. Either they are not specified for educational purposes or they are provided as a service by a library or a repository for searching a limited dataset of educational content. This paper presents a novel system for automatic harvesting and connecting of medical educational objects based on biomedical ontologies. The challenge in this work is to transform disjoint heterogeneous web databases entries into one coherent linked dataset. First, harvesting APIs were developed for collecting content from various web sources such as YouTube, blogging platforms, and PubMed library. Then, the system maps its entries into one data model and annotates its content using biomedical ontologies to enable its linkage. The resulted dataset is organized in a proposed NoSQL RDF Triple Store which consists of 2720 entries of articles, videos, and blogs. We tested the system using different ontologies for enriching its content such as MeSH and SNOMED CT and compared the results obtained. Using SNOMED CT doubled the number of linkages built between the dataset entries. Experiments of querying the dataset is conducted and the results are promising compared with simple text-based search.

Reem Qadan Al Fayez, Mike Joy

Implementing Peer-to-Peer Semantic Integration of Linked Data

Abstract

The World Wide Web has expanded from a network of hyper-linked documents to a more complex structure where both documents and data are easily published, consumed and reused. Ideally, users should be able to access this information as a single, global data space. However, Linked Data on the Web is highly heterogeneous: different datasets may describe overlapping domains, using different approaches to data modelling and naming. A single global ontological conceptualisation is impracticable, and instead a more extensible approach is needed for semantic integration of heterogeneous Linked Data sets into a global data space.

Mirko M. Dimartino, Andrea Calì, Alexandra Poulovassilis, Peter T. Wood

Graph Data

Frontmatter

Virtual Network Mapping: A Graph Pattern Matching Approach

Abstract

Virtual network mapping (\(\mathsf {VNM}\)) is to build a network on demand by deploying virtual machines in a substrate network, subject to constraints on capacity, bandwidth and latency. It is critical to data centers for coping with dynamic cloud workloads. This paper shows that \(\mathsf {VNM}\) can be approached by graph pattern matching, a well-studied database topic. (1) We propose to model a virtual network request as a graph pattern carrying various constraints, and treat a substrate network as a graph in which nodes and edges bear attributes specifying their capacity. (2) We show that a variety of mapping requirements can be expressed in this model, such as virtual machine placement, network embedding and priority mapping. (3) In this model, we formulate \(\mathsf {VNM}\) and its optimization problem with a mapping cost function. We establish complexity bounds of these problems for various mapping constraints, ranging from PTIME to NP-complete. For intractable optimization problems, we further show that these problems are approximation-hard, i.e., NPO-complete in general and APX-hard even for special cases.

Yang Cao, Wenfei Fan, Shuai Ma

A Fast Approach for Detecting Overlapping Communities in Social Networks Based on Game Theory

Abstract

Community detection, a fundamental task in social network analysis, aims to identify groups of nodes in a network such that nodes within a group are much more connected to each other than to the rest of the network. The cooperative theory and non-cooperative game theory have been used separately for detecting communities. In this paper, we develop a new approach that utilizes both cooperative and non-cooperative game theory to detect communities. The individuals in a social network are modelled as playing cooperative game for achieving and improving group’s utilities, meanwhile individuals also play the non-cooperative game for improving individual’s utilities. By combining the cooperative and non-cooperative game theories, utilities of groups and individuals can be taken into account simultaneously, thus the communities detected can be more rational and the computational cost will be decreased. The experimental results on synthetic and real networks show that our algorithm can fast detect overlapping communities.

Lihua Zhou, Peizhong Yang, Kevin Lü, Lizhen Wang, Hongmei Chen

Consistent RDF Updates with Correct Dense Deltas

Abstract

RDF is widely used in the Semantic Web for representing ontology data. Many real world RDF collections are large and contain complex graph relationships that represent knowledge in a particular domain. Such large RDF collections evolve in consequence of their representation of the changing world. Although this data may be distributed over the Internet, it needs to be managed and updated in the face of such evolutionary changes. In view of the size of typical collections, it is important to derive efficient ways of propagating updates to distributed data stores. The contribution of this paper is a detailed analysis of the performance of RDF change detection techniques. In addition the work describes a new approach to maintaining the consistency of RDF by using knowledge embedded in the structure to generate efficient update transactions. The evaluation of this approach indicates that it reduces the overall update size at the cost of increasing the processing time needed to generate the transactions.

Sana Al Azwari, John N. Wilson

Query-Oriented Summarization of RDF Graphs

Abstract

The Resource Description Framework (RDF) is the W3C’s graph data model for Semantic Web applications. We study the problem of RDF graph summarization: given an input RDF graph \(\mathtt {G}\), find an RDF graph \(\mathtt {S}_\mathtt {G}\) which summarizes \(\mathtt {G}\) as accurately as possible, while being possibly orders of magnitude smaller than the original graph. Our approach is query-oriented, i.e., querying a summary of a graph should reflect whether the query has some answers against this graph. The summaries are aimed as a help for query formulation and optimization. We introduce two summaries: a baseline which is compact and simple and satisfies certain accuracy and representativeness properties, but may oversimplify the RDF graph, and a refined one which trades some of these properties for more accuracy in representing the structure.

Šejla Čebirić, François Goasdoué, Ioana Manolescu

Data Exploration

Frontmatter

ReX: Extrapolating Relational Data in a Representative Way

Abstract

Generating synthetic data is useful in multiple application areas (e.g., database testing, software testing). Nevertheless, existing synthetic data generators generally lack the necessary mechanism to produce realistic data, unless a complex set of inputs are given from the user, such as the characteristics of the desired data. An automated and efficient technique is needed for generating realistic data. In this paper, we propose ReX, a novel extrapolation system targeting relational databases that aims to produce a representative extrapolated database given an original one and a natural scaling rate. Furthermore, we evaluate our system in comparison with an existing realistic scaling method, UpSizeR, by measuring the representativeness of the extrapolated database to the original one, the accuracy for approximate query answering, the database size, and their performance. Results show that our solution significantly outperforms the compared method for all considered dimensions.

Teodora Sandra Buda, Thomas Cerqueus, John Murphy, Morten Kristiansen

Evaluation Measures for Event Detection Techniques on Twitter Data Streams

Abstract

Twitter’s popularity as a source of up-to-date news and information is constantly increasing. In response to this trend, numerous event detection techniques have been proposed to cope with the rate and volume of social media data streams. Although most of these works conduct some evaluation of the proposed technique, a comparative study is often omitted. In this paper, we present a series of measures that we designed to support the quantitative and qualitative comparison of event detection techniques. In order to demonstrate the effectiveness of these measures, we apply them to state-of-the-art event detection techniques as well as baseline approaches using real-world Twitter streaming data.

Andreas Weiler, Michael Grossniklaus, Marc H. Scholl

A Framework for Selecting Deep Learning Hyper-parameters

Abstract

Recent research has found that deep learning architectures show significant improvements over traditional shallow algorithms when mining high dimensional datasets. When the choice of algorithm employed, hyper-parameter setting, number of hidden layers and nodes within a layer are combined, the identification of an optimal configuration can be a lengthy process. Our work provides a framework for building deep learning architectures via a stepwise approach, together with an evaluation methodology to quickly identify poorly performing architectural configurations. Using a dataset with high dimensionality, we illustrate how different architectures perform and how one algorithm configuration can provide input for fine-tuning more complex models.

Jim O’ Donoghue, Mark Roantree

Using Virtual Meeting Structure to Support Summarisation

Abstract

Archiving meeting transcripts in databases is not always efficient. Users need to be able to catch up with past meetings quickly, and therefore it is non-productive to read the full meeting transcript from scratch. A summarisation of the meeting transcript is preferable but the lack of meeting structure may lead to missing information. Therefore, we have introduced a virtual meeting system that is characterised by features that provide the meeting session with structure and a summarisation system that applies a TextRank approach on the structured meeting transcripts. The agenda with timed items guides the conversation. Thus the item delineation and title can be considered as the key characteristics of a valuable summary. Results show that combining an extraction summarisation technique with meeting structure leads to a relevant summary.

Antonios G. Nanos, Anne E. James, Rahat Iqbal, Yih-ling Hedley

NoSQL and Distributed Processing

Frontmatter

NotaQL Is Not a Query Language! It’s for Data Transformation on Wide-Column Stores

Abstract

It is simple to query a relational database because all columns of the tables are known and the language SQL is easily applicable. In NoSQL, there usually is no fixed schema and no query language. In this article, we present NotaQL, a data-transformation language for wide-column stores. NotaQL is easy to use and powerful. Many MapReduce algorithms like filtering, grouping, aggregation and even breadth-first-search, PageRank and other graph and text algorithms can be expressed in two or three short lines of code.

Johannes Schildgen, Stefan Deßloch

NoSQL Approach to Large Scale Analysis of Persisted Streams

Abstract

A potential problem for persisting large volume of streaming logs with conventional relational databases is that loading large volume of data logs produced at high rates is not fast enough due to the strong consistency model and high cost of indexing. As a possible alternative, state-of-the-art NoSQL data stores that sacrifice transactional consistency to achieve higher performance and scalability can be utilized. In this paper, we describe the challenges in large scale persisting and analysis of numerical streaming logs. We propose to develop a benchmark comparing relational databases with state-of-the-art NoSQL data stores to persist and analyze numerical logs. The benchmark will investigate to what degree a state-of-the-art NoSQL data store can achieve high performance persisting and large-scale analysis of data logs. The benchmark will serve as basis for investigating query processing and indexing of large-scale numerical logs.

Khalid Mahmood, Thanh Truong, Tore Risch

Horizontal Fragmentation and Replication for Multiple Relaxation Attributes

Abstract

The data replication problem (DRP) describes the task of distributing copies of data records (that is, database fragments) among a set of servers in a distributed database system. For the application of flexible query answering, several fragments can be overlapping (in terms of tuples in a database table). In this paper, we provide a formulation of the DRP for horizontal fragmentations with overlapping fragments; subsequently we devise a recovery procedure based on these fragmentations.

Lena Wiese

Scalability

Frontmatter

Scalable Queries Over Log Database Collections

Abstract

Various business application scenarios need to analyse the working status of products, e.g. to discover abnormal machine behaviours from logged sensor readings. The geographic locations of machines are often widely distributed and have measurements of logged sensor readings stored locally in autonomous relational databases, here called log databases, where they can be analysed through queries. A global meta-database is required to describe machines, sensors, measurements, etc. Queries to the log databases can be expressed in terms of these meta-data. FLOQ (Fused LOg database Query processor) enables queries searching collections of distributed log databases combined through a common meta-database. To speed up queries combining meta-data with distributed logged sensor readings, sub-queries to the log databases should be run in parallel. We propose two new strategies using standard database APIs to join meta-data with data retrieved from distributed autonomous log databases. The performance of the strategies is empirically compared with a state-of-the-art previous strategy to join autonomous databases. A cost model is used to predict the efficiency of each strategy and guide the experiments. We show that the proposed strategies substantially improve the query performance when the size of selected meta-data or the number of log databases are increased.

Minpeng Zhu, Khalid Mahmood, Tore Risch

ECST – Extended Context-Free Straight-Line Tree Grammars

Abstract

Grammar-based compressors like e.g. CluX [1], BPLEX [2], TreeRePAIR [3] transform an XML tree X into a context-free straight-line linear tree (CSLT) grammar G and yield strong compression ratios compared to other classes of XML-specific compressors. However, CSLT grammars have the disadvantage that simulating on G update operations like inserting, deleting, or re-labeling a node V of X requires to isolate the path from X’s root to V from all the paths represented by G. Usually, this leads to an increased redundancy within G, as grammar rules are copied and modified, but the original and the modified grammar rules often differ only slightly. In this paper, we propose extended context-free straight-line tree (ECST) grammars that allow reducing the redundancy created by path isolation. Furthermore, we show how to query and how to update ECST compressed grammars.

Stefan Böttcher, Rita Hartel, Thomas Jacobs, Markus Jeromin

Configuring Spatial Grids for Efficient Main Memory Joins

Abstract

The performance of spatial joins is becoming increasingly important in many applications, particularly in the scientific domain. Several approaches have been proposed for joining spatial datasets on disk and few in main memory. Recent results show that in main memory, grids are more efficient than the traditional tree based methods primarily developed for disk. The question how to configure the grid, however, has so far not been discussed.

In this paper we study how to configure a spatial grid for joining spatial data in main memory. We discuss the trade-offs involved, develop an analytical model predicting the performance of a configuration and finally validate the model with experiments.

Farhan Tauheed, Thomas Heinis, Anastasia Ailamaki

Transactional and Incremental Type Inference from Data Updates

Abstract

A distinctive property of relational database systems is the ability to perform data updates and queries in atomic blocks called transactions, with the well known ACID properties. To date, the ability of systems performing reasoning to maintain the ACID properties even over data held within a relational database, has been largely ignored. This paper studies an approach to reasoning over data from OWL 2 ontologies held in a relational database, where the ACID properties of transactions are maintained. Taking an incremental approach to maintaining materialised views of the result of reasoning, the approach is demonstrated to support a query and reasoning performance comparable to or better than other OWL reasoning systems, yet adding the important benefit of supporting transactions.

Yu Liu, Peter McBrien

Backmatter

Titel: Data Science
herausgegeben von: Sebastian Maneth
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-20424-6
Print ISBN: 978-3-319-20423-9
DOI: https://doi.org/10.1007/978-3-319-20424-6