Skip to main content
Top

2018 | Book

Data Management Technologies and Applications

6th International Conference, DATA 2017, Madrid, Spain, July 24–26, 2017, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the thoroughly refereed proceedings of the 6th International Conference on Data Management Technologies and Applications, DATA 2017, held in Madrid, Spain, in July 2017. The 13 revised full papers were carefully reviewed and selected from 66 submissions. The papers deal with the following topics: databases, big data, data mining, data management, data security, and other aspects of information systems and technology involving advanced applications of data.

Table of Contents

Frontmatter

Business Analytics

Frontmatter
An Overview of Transfer Learning Focused on Asymmetric Heterogeneous Approaches
Abstract
In practice we often encounter classification tasks. In order to solve these tasks, we need a sufficient amount of quality data for the construction of an accurate classification model. However, in some cases, the collection of quality data poses a demanding challenge in terms of time and finances. For example in the medical area, we encounter lack of data about patients. Transfer learning introduces the idea that a possible solution can be combining data from different domains represented by different feature spaces relating to the same task. We can also transfer knowledge from a different but related task that has been learned already. This overview focuses on the current progress in the novel area of asymmetric heterogeneous transfer learning. We discuss approaches and methods for solving these types of transfer learning tasks. Furthermore, we mention the most used metrics and the possibility of using metric or similarity learning.
Magda Friedjungová, Marcel Jiřina
A Mathematical Model for Customer Lifetime Value Based Offer Management
Abstract
Customers with prepaid lines possess higher attrition risk compared to postpaid customers, since prepaid customers do not sign long-term obligatory contracts and may churn anytime. For this reason, mobile operators have to offer engaging benefits to keep prepaid subscribers with the company. Since all such offers incur additional cost, mobile operators face an optimization problem while selecting the most suitable offers for customers at risk. In this study, an offer management framework targeting prepaid customers of a telecommunication company is developed. Proposed framework chooses the most suitable offer for each customer through a mathematical model, which utilizes customer lifetime value and churn risk. Lifetime values are estimated using logistic regression and Pareto/NBD models, and several variants of these models are used to predict churn risks using a large number of customer specific features.
Ahmet Şahin, Zehra Can, Erinc Albey
Construction of Semantic Data Models
Abstract
The production of scientific publications has increased 8–9% each year during the previous six decades [1]. In order to conduct state-of-the-art research, scientists and scholars have to dig relevant information out of a large volume of documents. Additional challenges to analyze scientific documents include the variability of publishing standards, formats, and domains. Novel methods are needed to analyze and find concrete information in publications rapidly. In this work, we present a conceptual design to systematically build semantic data models using relevant elements including context, metadata, and tables that appear in publications from any domain. To enrich the models, as well as to provide semantic interoperability among documents, we use general-purpose ontologies and a vocabulary to organize their information. The resulting models allow us to synthesize, explore, and exploit information promptly.
Martha O. Perez-Arriaga, Trilce Estrada, Soraya Abad-Mota
Mining and Linguistically Interpreting Summaries from Surveyed Data Related to Financial Literacy and Behaviour
Abstract
Financial decisions represent important decisions in everyday life, as they could affect the financial well-being of the individuals. These decisions are affected by many factors including level of financial literacy, emotions, heuristics and biases. This paper is devoted to mining and interpreting information regarding effect of financial literacy on individuals’ behaviour (angst, fear, nervousness, loss of control, anchoring and risk taking) from the data surveyed by questionnaire applying linguistic summaries. Fuzzy sets and fuzzy logic allow us to mathematically formalize linguistic terms such as most of, high literacy, low angst and the like, and interpret mined knowledge by short quantified sentences of natural language. This way is suitable for managing semantic uncertainty in data and in concepts. The results have shown that for the majority of respondents having low level of financial literacy, angst and other treats represent serious issues, as expected. On the other hand, about half of respondents with high level of literacy do not consider these treats as significant. This effect is emphasized by the experimenting with socio-demographic characteristics of respondents. This research has also observed problems in applying linguistic summaries on data from questionnaires and suggests some recommendations.
Miroslav Hudec, Zuzana Brokešová

Data Management and Quality

Frontmatter
Advanced Data Integration with Signifiers: Case Studies for Rail Automation
Abstract
In Rail Automation, planning future projects requires the integration of business-critical data from heterogeneous, often noisy data sources. Current integration approaches often neglect uncertainties and inconsistencies in the integration process and thus cannot guarantee the necessary data quality. To tackle these issues, we propose a semi-automated process for data import, where the user resolves ambiguous data classifications. The task of finding the correct data warehouse entry for a source value in a proprietary, often semi-structured format is supported by the notion of a signifier which is a natural extension of composite primary keys. In three different case studies we show that this approach (i) facilitates high-quality data integration while minimizing user interaction, (ii) leverages approximate name matching of railway station and entity names, (iii) contributes to extract features from contextual data for data cross-checks and thus supports the planning phases of railway projects.
Alexander Wurl, Andreas Falkner, Alois Haselböck, Alexandra Mazak
Narrative Annotation of Content for Cultural Legacy Preservation
Abstract
An important, yet underestimated, aspect of cultural heritage preservation is the analysis of personal narratives told by citizens. In this paper, we present a server architecture that facilitates multimedia content storage and sharing, along with the management of associated narrative information. Via the exposition of a RESTful interface, the proposed solution enables the collection of textual narratives in raw form, as well as the extraction of related narrative knowledge. We apply it to a corpus related to the time of the European construction in Luxembourg. We disclose details about our conceptual model and implementation, as well as experiments supporting the interest of our approach.
Pierrick Bruneau
Determining Appropriate Large Object Stores with a Multi-criteria Approach
Abstract
The area of storage solutions is becoming more and more heterogeneous. Even in the case of relational databases, there are several offerings, which differ from vendor to vendor and are offered for different deployments like on-premises or in the Cloud, as Platform-as-a-Service (PaaS) or as a special Virtual Machine on the Infrastructure-as-a-Service (IaaS) level. Beyond traditional relational databases, the NoSQL idea has gained a lot of attraction. Indeed, there are various services and products available from several providers. Each storage solution has virtues of its own even within the same product category for certain aspects. For example, some systems are offered as cloud services and pursue a pay-as-you-go principle without upfront investments or license costs. Others can be installed on premises, thus achieving higher privacy and security. Some store redundantly to achieve high reliability for higher costs. This paper suggests a multi-criteria approach for finding appropriate storage for large objects. Large objects might be, for instance, images of virtual machines, high resolution analysis images, or consumer videos. Multi-criteria means that individual storage requirements can be attached to objects and containers having the overall goal in mind to relieve applications from the burden to find corresponding appropriate storage systems. For efficient storage and retrieval, a metadata-based approach is presented that relies on an association with storage objects and containers. The heterogeneity of involved systems and their interfaces is handled by a federation approach that allows for transparent usage of several storages in parallel. All together applications benefit from the specific advantages of particular storage solutions for specific problems. In particular, the paper presents the required extensions for an object storage developed by the VISION Cloud project.
Uwe Hohenstein, Spyridon V. Gogouvitis, Michael C. Jaeger
Player Performance Evaluation in Team-Based First-Person Shooter eSport
Abstract
Electronic sports or pro gaming have become very popular in this millenium and the increased value of this new industry is attracting investors with various interests. One of these interest is game betting, which requires player and team rating, game result predictions, and fraud detection techniques. This paper discusses several aspects of analysis of game recordings in Counter-Strike: Global Offensive game including decoding the game recordings, matching of different sources of player data, quantifying player performance, and evaluation of economical aspects of the game.
David Bednárek, Martin Kruliš, Jakub Yaghob, Filip Zavoral
Utilization Measures in a Learning Management System
Abstract
Learning Management Systems (LMSs) are becoming more and more popular and incorporate many different functionalities. For this reason, an evaluation of the quantitative utilization of all the parts of a LMS is essential. In this research we propose indicators and techniques which allow to understand in detail how a functionality is accessed by the users. These analytic tools are useful in particular for the administrators of the LMS which are in charge of allocating resources according to the workload and importance of the functionalities. We tested the proposed indicators with the data obtained from the LMS of Università degli Studi di Milano-Bicocca (Milan, Italy) about the messaging functionality. Although the students’ messages can potentially be a source of big data, in the present case it is observed that the utilization is limited. With this analysis it has been possible to notice a similarity between the utilization of the message system and the empirical Zipf law. We also introduced the description of the structure of a dashboard which allows to access to the indicators and goes towards the definition of a global tool for students, teachers and administrators.
Floriana Meluso, Paolo Avogadro, Silvia Calegari, Matteo Dominoni
Experiences in the Development of a Data Management System for Genomics
Abstract
GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available.
In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions.
Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.
Stefano Ceri, Arif Canakoglu, Abdulrahman Kaitoua, Marco Masseroli, Pietro Pinoli

Databases and Data Security

Frontmatter
Server-Side Database Credentials: A Security Enhancing Approach for Database Access
Abstract
Database applications are a very pervasive tool that enable businesses to make the most out of the data they collect and generate. Furthermore, they can also be used to provide services on top of such data that can access, process, modify and explore it. It was argued in the work this paper extends that when client applications that access a database directly run on public or semi-public locations that are not highly secured (such as a reception desk), the database credentials used could be stolen by a malicious user. To prevent such an occurrence, solutions such as virtual private networks (VPNs) can be used to secure access to the database. However, VPNs can be bypassed by accessing the database from within the business network in an internal attack, among other problems. A methodology called Secure Proxied Database Connectivity (SPDC) is presented which aims to push the database credentials out of the client applications and divides the information required to access them between a proxy and an authentication server, while supporting existing tools and protocols that provide access to databases, such as JDBC. This approach will be shown and further detailed in this paper in terms of attack scenarios, implementation and discussion.
Diogo Domingues Regateiro, Óscar Mortágua Pereira, Rui L. Aguiar
ChronoGraph: A Versioned TinkerPop Graph Database
Abstract
Database content versioning is an established concept in modern SQL databases, which also became part of the SQL standard in 2011. It is used in business applications to support features such as traceability of changes, auditing, historical data analysis and trend analysis. However, versioning capabilities have barely been considered outside of the relational context so far. In particular in the emerging graph technologies, these aspects are being neglected by database vendors. This paper presents ChronoGraph (This work was partially funded by the research project “txtureSA” (FWF-Project P 29022).), the first full-featured TinkerPop-compliant graph database that provides support for transparent system-time content versioning and analysis. This paper offers two key contributions: We present the concepts and architecture of ChronoGraph as a new addition to the state of the art in graph databases, and we also provide our implementation as an open-source project. In order to demonstrate the feasibility of our proposed solution, we compare it with existing, non-versioned graph databases in a controlled experiment.
Martin Haeusler, Thomas Trojer, Johannes Kessler, Matthias Farwick, Emmanuel Nowakowski, Ruth Breu
The Case for Personalized Anonymization of Database Query Results
Abstract
The benefit of performing Big data computations over individual’s microdata is manifold, in the medical, energy or transportation fields to cite only a few, and this interest is growing with the emergence of smart disclosure initiatives around the world. However, these computations often expose microdata to privacy leakages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promised by statistical institutes.
In this paper, we consolidate our previous results to show how it is possible to push personalized privacy guarantees in the processing of database queries. By doing so, individuals can disclose different amounts of information (i.e. data at different levels of accuracy) depending on their own perception of the risk, and we discuss the different possible semantics of such models.
Moreover, we propose a decentralized computing infrastructure based on secure hardware enforcing these personalized privacy guarantees all along the query execution process. A complete performance analysis and implementation of our solution show the effectiveness of the approach to tackle generic large scale database queries.
Axel Michel, Benjamin Nguyen, Philippe Pucheral
Backmatter
Metadata
Title
Data Management Technologies and Applications
Editors
Joaquim Filipe
Jorge Bernardino
Christoph Quix
Copyright Year
2018
Electronic ISBN
978-3-319-94809-6
Print ISBN
978-3-319-94808-9
DOI
https://doi.org/10.1007/978-3-319-94809-6

Premium Partner