Skip to main content

2019 | Buch

Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis

15th International Conference, BDAS 2019, Ustroń, Poland, May 28–31, 2019, Proceedings

herausgegeben von: Prof. Dr. Stanisław Kozielski, Dariusz Mrozek, Paweł Kasprowski, Bożena Małysiak-Mrozek, Daniel Kostrzewa

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 15th International Conference entitled Beyond Databases, Architectures and Structures, BDAS 2019, held in Ustroń, Poland, in May 2019.

It consists of 26 carefully reviewed papers selected from 69 submissions. The papers are organized in topical sections, namely big data and cloud computing; architectures, structures and algorithms for efficient data processing and analysis; artificial intelligence, data mining and knowledge discovery; image analysis and multimedia mining; bioinformatics and biomedical data analysis; industrial applications; networks and security.

Inhaltsverzeichnis

Frontmatter
Correction to: Serialization for Property Graphs
Dominik Tomaszuk, Renzo Angles, Łukasz Szeremeta, Karol Litman, Diego Cisterna

Big Data and Cloud Computing

Frontmatter
Nova: Diffused Database Processing Using Clouds of Components [Vision Paper]
Abstract
Nova proposes a departure from today’s complex monolithic database management systems (DBMSs) as a service using the cloud. It advocates a server-less alternative consisting of a cloud of simple components that communicate using high speed networks. Nova will monitor the workload of an application continuously, configuring the DBMS to use the appropriate implementation of a component most suitable for processing the workload. In response to load fluctuations, it will adjust the knobs of a component to scale it to meet the performance requirements of the application. The vision of Nova is compelling because it adjusts resource usage, preventing either over-provisioning of resources that sit idle or over-utilized resources that yield a low performance, optimizing total cost of ownership. In addition to introducing Nova, this vision paper presents key research challenges that must be addressed to realize Nova. We explore two challenges in detail.
Shahram Ghandeharizadeh, Haoyu Huang, Hieu Nguyen
Big Data in Power Generation
Abstract
The coal-fired power plant regularly produces enormous amounts of data from its sensors, control and monitoring systems. The Volume of this data will be increasing due to widely available smart meters, Wi-Fi devices and rapidly developing IT systems. Big data technology gives the opportunity to use such types and volumes of data and could be an adequate solution in the areas, which have been untouched by information technology yet. This paper describes the possibility to use big data technology to improve internal processes on the example of a coal-fired power plant. Review of applying new technologies is made from an internal point of view, drawing from the professional experience of the authors. We are taking a closer look into the power generation process and trying to find areas to develop insights, hopefully enabling us to create more value for the industry.
Marek Moleda, Dariusz Mrozek
Using GPU to Accelerate Correlation on Seismic Signal
Abstract
In analyzing the quality of seismic signal, the fundamental mathematical operation is the convolution of signal with basic signal. Analyses carried out in the field need solutions that can be executed by a single machine. Meanwhile the size of processed data from land seismic surveys is in order of tens of terabytes. In this article the efficient computation of convolution on GPU cores is proposed. We state that this approach if faster than even using parallel programming on CPU. It will be shown how big performance gain was achieved when using a graphic card that is several times less expensive than used CPU.
Dominika Pawłowska, Piotr Wiśniewski
Detection of Dangers in Human Health with IoT Devices in the Cloud and on the Edge
Abstract
Smart bands are wearable devices that are frequently used in monitoring people’s activity, fitness, and health state. They can be also used in early detection of possibly dangerous health-related problems. The increasing number of wearable devices frequently transmitting data to scalable monitoring centers located in the Cloud may raise the Big Data challenge and cause network congestion. In this paper, we focus on the storage space consumed while monitoring people with smart IoT devices and performing classification of their health state and detecting possibly dangerous situations with the use of machine learning models in the Cloud and on the Edge. We also test two different repositories for storing sensor data in the Cloud monitoring center – a relational Azure SQL Database and the Cosmos DB document store.
Mateusz Gołosz, Dariusz Mrozek

Architectures, Structures and Algorithms for Efficient Data Processing and Analysis

Frontmatter
Serialization for Property Graphs
Abstract
Graph serialization is very important for the development of graph-oriented applications. In particular, serialization methods are fundamental in graph data management to support database exchange, benchmarking of systems, and data visualization. This paper presents YARS-PG, a data format for serializing property graphs. YARS-PG was designed to be simple, extensible and platform independent, and to support all the features provided by the current database systems based on the property graph data model.
Dominik Tomaszuk, Renzo Angles, Łukasz Szeremeta, Karol Litman, Diego Cisterna
Evaluation of Key-Value Stores for Distributed Locking Purposes
Abstract
This paper presents the evaluation of key-value stores and corresponding algorithms with regard to the implementation of distributed locking mechanisms. Research focuses on the comparison between four types of key-value stores, etcd, Consul, Zookeeper, and Redis. For each selected store, the underlying implementation of locking mechanisms was described and evaluated with regard to satisfying safety, deadlock-free, and fault tolerance properties. For the purposes of performance testing, a small application supporting all of the key-value stores was developed. The application uses all of the selected solutions to perform computation while ensuring that a particular resource is locked during that operation. The aim of the conducted experiments was to evaluate selected solutions based on performance and properties that they hold, in the context of using them as a base for building a distributed locking system.
Piotr Grzesik, Dariusz Mrozek
On Repairing Referential Integrity Constraints in Relational Databases
Abstract
Integrity constraints (ICs) are semantic conditions that a database should satisfy in order to be in a consistent state. Typically, ICs are declared with the database schema and enforced by the database management system (DBMS). However, in practice, ICs may not be specified to the DBMS along with the schema, this is considered a bad database design and may lead to many problems such as inconsistency and anomalies. In this paper, we present a method to identify and repair missing referential integrity constraints (foreign keys). Our method comprises three steps of verification of candidate foreign keys: data-based, model-based, and brute-force.
Raji Ghawi
Interactive Decomposition of Relational Database Schemes Using Recommendations
Abstract
Schema decomposition is a well known method for logical database design. Decomposition mainly aims at redundancy reduction and elimination of anomalies. A good decomposition should preserve dependencies and maintain recoverability of information. We propose a semi-automatic method for decomposing a relational schema in an interactive way. A database designer can build the subschemes step-by-step, guided by quantitative measures of decomposition “goodness”. At each step, a ranked set of recommendations are provided to the designer to guide him to the next possible actions that lead to a better design.
Raji Ghawi

Artificial Intelligence, Data Mining and Knowledge Discovery

Frontmatter
Comparison Study on Convolution Neural Networks (CNNs) vs. Human Visual System (HVS)
Abstract
Computer vision image recognition has undergone remarkable evolution due to available large-scale datasets (e.g., ImageNet, UECFood) and the evolution of deep Convolutional Neural Networks (CNN). CNN’s learning method is data-driven from a sufficiently large training data, containing organized hierarchical image features, such as annotations, labels, and distinct regions of interest (ROI). However, acquiring such dataset with comprehensive annotations, in many domains is still a challenge. Currently, there are three main techniques to employ CNN: train the network from zero; use an off-the-shelf already trained network, and perform unsupervised pre-training with supervised adjustments. Deep learning networks for image classification, regression and feature learning include Inception-v3, ResNet-50, ResNet-101, GoogLeNet, AlexNet, VGG-16, and VGG-19.
In this paper we exploit the use of three CNN to solve detection problems. First, the different CNN architectures are evaluated. The studied CNN models contain 5 thousand to 160 million parameters, which can vary depending on the number of layers. Secondly, the different studied CNN’s are evaluated, based on dataset sizes, and spatial image context. Results showing performance vs. training time vs. accuracy are analyzed. Thirdly, based on human knowledge and human visual system (HVS) classification, the accuracy of CNN’s is studied and compared. Based on obtained results it is possible to conclude that the HVS is more accurate when there is a data set with a wide variety range. However, if the data-set if focused on only niche images, the CNN shows better results than the HVS.
Manuel Caldeira, Pedro Martins, José Cecílio, Pedro Furtado
Multi-criteria Decision Analysis in the Railway Risk Management Process
Abstract
The article presents a way to use a Multi-criteria Decision Analysis (MCDA) method such as Analytic Hierarchy Process (AHP) for the assessment and selection of security and safety measures in the railway industry. The situation in the industry regarding security information exchange and the risk management process was presented briefly, as well as a proposal to support this process with the elements of the multi-criteria analysis.
Jacek Bagiński, Barbara Flisiuk, Wojciech Górka, Dariusz Rogowski, Tomasz Stęclik
NFL – Free Library for Fuzzy and Neuro-Fuzzy Systems
Abstract
The paper presents «Neuro-Fuzzy Library» (NFL) – a free library for fuzzy and neuro-fuzzy systems. The library written in C++ is available from the GitHub repository. The library implements data modifiers (for complete and incomplete data), clustering algorithms, fuzzy systems (descriptors, t-norms, premises, consequences, rules, and implications), neuro-fuzzy systems (precomposed MA, TSK, ANNBFIS, and subspace ANNBFIS for both classification and regression tasks). The paper is accompanied by numerical examples.
Krzysztof Siminski
Detection of Common English Grammar Usage Errors
Abstract
Our research aims to provide writers with automated tools to detect grammatical usage errors and thus improve their writing. Correct English usage is often lacking in scientific and industry papers. [16] has compiled 130 common English usage errors. We address the automated detection of these errors, and their variations, that writers often make. Grammar checkers, e.g., [9] and [11], also implement error detection. Other researchers have employed machine learning and neural networks to detect errors. We parse only the part of speech (POS) tags using different levels of generality of POS syntax and word-sense semantics. Our results provide accurate error detection and are feasible for a wide range of errors. Our algorithm specifies precisely the ability to increase or decrease the generality in order to prevent a large number of false positives. We derive this observation as a result of using The Brown corpus, which consists of 55, 889 untagged sentences, covering most genres of English usage, both fiction and non-fiction. This corpus was much larger than any corpus employed by related researchers. We implemented 80 of Swan’s most common 130 rules; and detected 35 true positives distributed among 15 of Swan’s rules. Such a low true positive rate, 35/55889, had been expected. No false positives were detected. We employed a separate, smaller, test suite of both true positive and true negative examples. Our system, as expected, correctly detected errors in all the true positive examples, and ignored all the true negative ones. The Language-Tool system had a detection rate of 28/130 = 22%; Grammarly had a detection rate of 60/130 = 46%. Our results show significant improvement in the detection of common English usage errors.
Luke Immes, Haim Levkowitz
Link Prediction Based on Time Series of Similarity Coefficients and Structural Function
Abstract
A social network is a structure whose nodes represent people or other entities embedded in a social context while its edges symbolize interaction, collaboration or exertion of influence between these fore-mentioned entities [3]. From a wide class of problems related to social networks, the ones related to link dynamics seems particularly interesting. A noteworthy link prediction technique, based on analyzing the history of the network (i.e. its previous states), was presented by Prudêncio and da Silva Soares in [5]. In this paper, we attempt to improve the quality of edges’ formation prognosis in social networks by proposing a modified version of aforementioned method. For that purpose we shall compute values of certain similarity coefficients and use them as an input to a supervised classification mechanism (called structural function). We stipulate that this function changes over time, thus making it possible to derive time series for all of its parameters and obtain their next values using a forecasting model. We might then predict new links’ occurrences using the forecasted values of similarity metrics and supervised classification method with the predicted parameters. This paper contains also the comparison of ROC charts for both legacy solution and the novel method.
Piotr Stąpor, Ryszard Antkiewicz, Mariusz Chmielewski
The Analysis of Relations Between Users on Social Networks Based on the Polish Political Tweets
Abstract
The article presents a method of acquiring selected topical data from Twitter conversations and storing it in the relational database schema with the use of a scheduled cyclical process. This kind of storage creates the opportunity to analyze the dependencies between users, hashtags, mentions etc. based on SQL or its procedural extension. Additionally, it is possible to construct data views, which facilitates the creation of front end applications leading to efficient generation of cross sections of data based on different features: time of creation, strength of relations etc. Taking all of this into consideration, an application was developed that allows for graphical representation of relations with the use of the two algorithms: a Fruchterman-Reingold and a radial one. This program accepts the visual analysis and additionally it creates the opportunity to manipulate parts of a graph or separate nodes, and obtain descriptions of their features. Some conclusions about the relations between people conversing about Polish political scene were presented.
Adam Pelikant

Image Analysis and Multimedia Mining

Frontmatter
Poincaré Metric in Algorithms for Data Mining Tools
Abstract
Today we cannot imagine life without computers. The massive use of the information communication technologies has produced large amounts of data that are difficult to interpret and use. With data mining tools and machine learning methods, large data sets can be processed and used for prediction and classification. This paper employees the well known classification algorithm the k nearest neighbour and it modified use the Poincaré measurment distance instead of traditional Euclidean distance. The reason is that in different industries (economy, health, military ...) it increasingly uses and stores databases of various images or photographs. When recognizing the similarity between two photographs, it is important that the algorithm recognizes certain patterns. Recognition is based on metrics. For this purposes an algorithm based on Poincaré metric is tested on a data set of photos. A comparison was made on algorithm based on Euclidean metric.
Alenka Trpin, Biljana Mileva Boshkoska, Pavle Boškoski
Super-Resolution Reconstruction Using Deep Learning: Should We Go Deeper?
Abstract
Super-resolution reconstruction (SRR) is aimed at increasing image spatial resolution from multiple images presenting the same scene or from a single image based on the learned relation between low and high resolution. Emergence of deep learning allowed for improving single-image SRR significantly in the last few years, and a variety of deep convolutional neural networks of different depth and complexity were proposed for this purpose. However, although there are usually some comparisons reported in the papers introducing new deep models for SRR, such experimental studies are somehow limited. First, the networks are often trained using different training data, and/or prepared in a different way. Second, the validation is performed for artificially-degraded images, which does not correspond to the real-world conditions. In this paper, we report the results of our extensive experimental study to compare several state-of-the-art SRR techniques which exploit deep neural networks. We train all the networks using the same training setup and validate them using several datasets of different nature, including real-life scenarios. This allows us to draw interesting conclusions that may be helpful for selecting the most appropriate deep architecture for a given SRR scenario, as well as for creating new SRR solutions.
Daniel Kostrzewa, Szymon Piechaczek, Krzysztof Hrynczenko, Paweł Benecki, Jakub Nalepa, Michal Kawulok
Application of Fixed Skipped Steps Discrete Wavelet Transform in JP3D Lossless Compression of Volumetric Medical Images
Abstract
In this paper, we report preliminary results of applying a step skipping to the discrete wavelet transform (DWT) in lossless compression of volumetric medical images. In particular, we generalize the two-dimensional (2D) fixed variants of skipped steps DWT (SS-DWT), which earlier were found effective for certain 2D images, to a three-dimensional (3D) case and employ them in JP3D (JPEG 2000 standard extension for 3D data) compressor. For a set of medical volumetric images of modalities CT, MRI, and US, we find that, by adaptively selecting 3D fixed variants of SS-DWT, we may improve the JP3D bitrates in an extent competitive to much more complex modifications of DWT and JPEG 2000.
Roman Starosolski

Bioinformatics and Biomedical Data Analysis

Frontmatter
A Novel Approach for Fast Protein Structure Comparison and Heuristic Structure Database Searching Based on Residue EigenRank Scores
Abstract
With the rapid growth of public protein structure databases, computational techniques for storing as well as comparing proteins in an efficient manner are still in demand. Proteins play a major role in virtually all processes in life, and comparing their three-dimensional structures is essential to understanding the functional and evolutionary relationships between them.
In this study, a novel approach to compute three-dimensional protein structure alignments by means of so-called EigenRank score profiles is proposed. These scores are obtained by utilizing the LeaderRank algorithm—a vertex centrality indexing scheme originally introduced to infer the opinion leading role of individual actors in social networks. The obtained EigenRank representation of a given structure is not just highly-specific, but can also be used to compute profile alignments from which three-dimensional structure alignments can be rapidly deduced. This technique thus could provide a tool to rapidly scan entire databases containing thousands of structures.
Florian Heinke, Lars Hempel, Dirk Labudde
The Role of Feature Selection in Text Mining in the Process of Discovering Missing Clinical Annotations – Case Study
Abstract
Vocabulary used by the doctors to describe the results of medical procedures changes alongside with the new standards. Text data, which is immediately understandable by the medical professional, is difficult to use in mass scale analysis. Extraction of data relevant to the given case, e.g. Bethesda class, means taking on the challenge of normalizing the freeform text and all the grammatical forms associated with it. This is particularly difficult in the Polish language where words change their form significantly according to their function in the sentence. We found common black-box methods for text mining inaccurate for this purpose. Here we described a word-frequency-based method for annotation of text data for Bethesda class extraction. We compared them with an algorithm based on a decision tree C4.5. We showed how important is the choice of the method and range of features to avoid conflicting classification. Proposed algorithms allowed to avoid the rule-base limitations.
Aleksander Płaczek, Alicja Płuciennik, Mirosław Pach, Michał Jarząb, Dariusz Mrozek
Fuzzy Join as a Preparation Step for the Analysis of Training Data
Abstract
Analysis of training data has become an inseparable part of sports preparation not only for professional athletes but also for sports enthusiasts and sports amateurs. Nowadays, smart wearables and IoT devices allow monitoring of various parameters of our physiology and activity. The intensity and effectiveness of the activity and values of some physiology parameters may depend on weather conditions in particular days. Therefore, for efficient analysis of training data, it is important to align training data to weather sensor data. In this paper, we show how this process can be performed with the use of the fuzzy join technique, which allows to combine data points shifted in time.
Anna Wachowicz, Dariusz Mrozek

Industrial Applications

Frontmatter
On the Interdependence of Technical Indicators and Trading Rules Based on FOREX EUR/USD Quotations
Abstract
The general aim of this paper is to investigate the interdependence within the wide set (2657) of technical analysis indicators and trading rules based on daily FOREX quotations from 01.01.2004 to 20.09.2018. For the purpose of this paper, we have limited our study to EUR/USD quotations only. The most frequently used methods for FOREX behavior modeling are regression, neural networks, ARIMA, GARCH and exponential smoothing (cf. [1, 3, 6]). They are used to predict or validate inputs. Inputs interdependence may cause the following problems:
  • The error term is obviously not normally distributed (for regression it is heteroscedastic). Therefore we lose the main tool for the model validation.
  • R-squared becomes a useless measure.
  • The obtained model is problematic for forecasting purposes. We would normally like to forecast the probability of a certain set of independent variables to create a certain output - the FOREX observations.
  • Regression and neural network methods use the inverses of some matrices. When we use two identical variables as input, the matrix is singular. When we use some dependent variables, the matrix is “almost” singular. It leads to model instability (assuming that computations are possible at all).
  • It is meaningless to evaluate which of the two identical inputs is more significant.
Therefore the independence of inputs is a crucial problem for FOREX market investigation. It may be done directly, as shown in this paper, or by means of PCA techniques, where the inputs are mapped into the small set of independent variables. Unfortunately, in the second case, the economical meaning is lost.
The obtained results may be treated as the base for building FOREX market models, which is one of our future goals.
Bartłomiej Kotyra, Andrzej Krajka
The Comparison of Processing Efficiency of Spatial Data for PostGIS and MongoDB Databases
Abstract
This paper presents the issue of geographic data storage in NoSQL databases. The authors present the performance investigation of the non-relational database MongoDB with its built-in spatial functions in relation to the PostgreSQL database with a PostGIS spatial extension. As part of the tests, the authors were designed queries simulating common problems in the processing of point data. In addition, the main advantages and disadvantages of NoSQL databases are presented in the context of the ability to manipulate spatial data.
Dominik Bartoszewski, Adam Piorkowski, Michal Lupa
Fuzzy Modelling of the Methane Hazard Rate
Abstract
Methane hazard assessment is an important aspect of coal mining, influencing the safety of miners and work efficiency. Calculation of the methane hazard rate requires specialized equipment, which might be not fully available at the monitored place and consequently might require manual measurements by miners. The lack of measurements can be also caused by a device failure, thus hampering continuous evaluation of the methane rate. In this paper we address this problem by constructing a fuzzy system being able to calculate the methane hazard rate in a continuous manner and deal with data incompleteness. We examine the effectiveness of different fuzzy clustering algorithms in our system and compare the proposed system to other state-of-the-art methods. The extensive experiments show that the proposed method is characterized by superior accuracy when compared to other methods.
Dariusz Felka, Marcin Małachowski, Łukasz Wróbel, Jarosław Brodny

Networks and Security

Frontmatter
Application of Audio over Ethernet Transmission Protocol for Synchronization of Seismic Phenomena Measurement Data in order to Increase Phenomena Localization Accuracy and Enable Programmable Noise Cancellation
Abstract
Seismic phenomena, in particular underground tremors, as extremely dangerous, require activities related to their location and prediction. The registration of high-energy phenomena is not a technological challenge at present, while recognizing their precursors as low-energy phenomena, and in particular the associated precise location and isolation from the seismic (acoustic) background, is relatively problematic. The article presents the concept and work related to the elimination of desynchronization of measurements in the seismoacoustic band and ensuring the compliance of the measurement phase in order to increase the accuracy of phenomena location and the possibility to use standard programming tools to eliminate noise (seismoacoustic background).
Krzysztof Oset, Dariusz Babecki, Sławomir Chmielarz, Barbara Flisiuk, Wojciech Korski
A Novel SQLite-Based Bare PC Email Server
Abstract
We describe a SQLite-based mail server that runs on a bare PC with no operating system. The mail server application is integrated with a server-based adaptation of the popular SQLite client database engine. The SQLite database is used for storing mail messages, and mail clients can send/receive email and share files using any Web browser as in a conventional system. The unique features of the bare PC SQLite-based email server include (1) no OS vulnerabilities; (2) the inability for attackers to run any other software including scripts; (3) no support for dynamic linking and execution of external code; (4) a small code footprint making it easy to analyze the code for security flaws; and (5) performance benefits due to eliminating OS overhead. We describe system design and implementation, and give details of the bare machine mail server application. This work serves as a foundation to build future bare machine servers with integrated databases that can support Internet-based collaboration in high-security environments.
Hamdan Alabsi, Ramesh Karne, Alex Wijesinha, Rasha Almajed, Bharat Rawal, Faris Almansour
Building Security Evaluation Lab - Requirements Analysis
Abstract
Physical protection of a laboratory must be built according to many stringent standards, criteria, and guidelines which are often general and difficult to apply in practice. This is a pity, because many organizations may save a lot of time, money and effort if they had a way of selecting the right security measures at the beginning of a process. Introducing a simple evaluation method of safeguards into requirements analysis can dramatically facilitate the designing phase of the lab’s physical security. In the result, more institutions would decide to cope with the problem of fulfilling security requirements by choosing concrete solutions within the assumed budget.
Dariusz Rogowski, Rafał Kurianowicz, Jacek Bagiński, Roman Pietrzak, Barbara Flisiuk
Backmatter
Metadaten
Titel
Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis
herausgegeben von
Prof. Dr. Stanisław Kozielski
Dariusz Mrozek
Paweł Kasprowski
Bożena Małysiak-Mrozek
Daniel Kostrzewa
Copyright-Jahr
2019
Electronic ISBN
978-3-030-19093-4
Print ISBN
978-3-030-19092-7
DOI
https://doi.org/10.1007/978-3-030-19093-4