Top

2018 | Book

Read chapter Read first chapter

Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety

14th International Conference, BDAS 2018, Held at the 24th IFIP World Computer Congress, WCC 2018, Poznan, Poland, September 18-20, 2018, Proceedings

Editors: Stanisław Kozielski, Dariusz Mrozek, Paweł Kasprowski, Bożena Małysiak-Mrozek, Daniel Kostrzewa

Publisher: Springer International Publishing

Book Series : Communications in Computer and Information Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the 14th International Conference entitled Beyond Databases, Architectures and Structures, BDAS 2018, held in Poznań, Poland, in September 2018, during the IFIP World Computer Congress.

It consists of 38 carefully reviewed papers selected from 102 submissions. The papers are organized in topical sections, namely big data and cloud computing; architectures, structures and algorithms for efficient data processing; artificial intelligence, data mining and knowledge discovery; text mining, natural language processing, ontologies and semantic web; image analysis and multimedia mining.

Frontmatter

Big Data and Cloud Computing

Frontmatter

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Entity Resolution (ER) is a task to identify records that refer to the same real-world entities. A naive way to solve ER tasks is to calculate the similarity of the Cartesian product of all records, which is called pair-wise ER and leads to quadratic time complexity. Faced with an exploding data volume, pair-wise ER is challenged to achieve high efficiency and scalability. To tackle this challenge, parallel computing is proposed for speeding up the ER process. Due to the difficulty of distributed programming, big data processing frameworks are often used as tools to ease the realization of parallel ER, supporting data partitioning, workload balancing, and fault tolerance. However, the efficiency and scalability of parallel ER is also influenced by the adopted framework. In the area of parallel ER, the adoption of Apache Spark, a general framework supporting in-memory computation, still is not widely studied. Furthermore, though Apache Spark provides both low-level (RDD-based) and high-level APIs (Datasets-based), to date, only RDD-based APIs have been adopted in parallel ER research. In this paper, we have implemented a Spark-SQL-based ER process and explored its persistence capability to see the performance benefits. We have evaluated its speedup and compared its efficiency to Spark-RDD-based ER. We observed that different persistence options have a large impact on the efficiency of Spark-SQL-based ER, requiring a careful consideration for choosing it. By adopting the best persistence option, the efficiency of our Spark-SQL-based ER implementation is improved up to 3 times on different datasets, over a baseline without any persistence option or with misconfigured persistence.

Xiao Chen, Roman Zoun, Eike Schallehn, Sravani Mantha, Kirity Rapuru, Gunter Saake

The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis

The cancer and the cancer mortality may seem the sign of the present times. This leads hundreds of scientists to handle the issue of finding significant premises of cancer occurrence. In this paper a set of data mining tasks is defined that joins the observed genes mutation with the specific cancer type observation. Due to the high computational complexity of this kind of data a Hadoop ecosystem cluster was developed to perform the required calculations. The results may be satisfactory in the domains of distributed data storage (processing) and the genes mutation occurrence interpretation.

Michał Bochenek, Kamil Folkert, Roman Jaksik, Michał Krzesiak, Marcin Michalak, Marek Sikora, Tomasz Stȩclik, Łukasz Wróbel

Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud

Exploration of 3D protein structures provides a broad potential for possible applications of its results in medical diagnostics, drug design, and treatment of patients. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. However, the process is time-consuming and requires increased computational resources when performed against large repositories. In this paper, we show that 3D protein structure similarity searching can be significantly accelerated by using modern processing techniques and computer architectures. Results of our experiments prove that by distributing computations on large Hadoop/HBase (HDInsight) clusters and scaling them out and up in the Microsoft Azure public cloud we can reduce the execution times of similarity search processes from hundred of hours to minutes. We will also show that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when scaling time-consuming computations over a mass of biological data.

Bożena Małysiak-Mrozek, Paweł Daniłowicz, Dariusz Mrozek

EYE: Big Data System Supporting Preventive and Predictive Maintenance of Robotic Production Lines

This paper presents the EYE – data storage and analysis system. The EYE is a platform for gathering and processing data coming from production lines. It was developed on the basis of the Big Data technology, allowing not only to process the streaming data but also for performing the batch analyses. The results of data processing are presented in the form of reports and dashboards. The work contains a case study presenting an implementation of the system on a production line which is used for the production of telemetric devices.

Jarosław Kurpanik, Joanna Henzel, Marek Sikora, Łukasz Wróbel, Marek Drewniak

Architectures, Structures and Algorithms for Efficient Data Processing

Frontmatter

SINGLE vs. MapReduce vs. Relational: Predicting Query Execution Time

Over the past decade’s several new concepts emerged to organize and query data over large Data Warehouse (DW) system with the same primary objective, that is, optimize processing speed. More recently, with the rise of BigData concept, storage cost lowered significantly, and performance (random accesses) increased, particularly with modern SSD disks. This paper introduces and tested a storage alternative which goes against current data normalization premises, where storage space is no longer a concern. By de-normalizing the entire data schema (transparent to the user) it is proposed a new concept system where query execution time must be entirely predictable, independently of its complexity, called, SINGLE. The proposed data model also allows easy partitioning and distributed processing to enable execution parallelism, boosting performance, as happens in MapReduce. TPC-H benchmark is used to evaluate storage space and query performance. Results show predictable performance when comparing with approaches based on a normalized relational schema, and MapReduce oriented.

Maryam Abbasi, Pedro Martins, José Cecílio, João Costa, Pedro Furtado

EvOLAP Graph – Evolution and OLAP-Aware Graph Data Model

The objective of this paper is to propose a graph model that would be suitable for providing OLAP features on graph databases. The included features allow for a multidimensional and multilevel view on data and support analytical queries on operational and historical graph data.In contrast to many existing approaches tailored for static graphs, the paper addresses the issue for the changing graph schema.The model, named Evolution and OLAP-aware Graph (EvOLAP Graph), has been implemented on a time-based, versioned property graph model implemented in Neo4j graph database.

Ewa Guminska, Teresa Zawadzka

Entropy Aware Adaptive Compression for SQL Column Stores

With the advent of SQL column stores, compression has gained renewed interest and drawn considerable attention from both academia and industry. Unlike row stores, column stores use lightweight compression methods and, generally, compression granularity is at entire column level. In this paper we outline and explore an alternative compression strategy for column stores that works at a different granularity and adapts itself to data, on-the-fly, using a compression planner. The approach yields good compression ratios, facilitates compression during bulk data load and also mitigates some issues that arise from having to maintain global meta-data on compression. We describe its implementation in analytics database dbX, a cloud agnostic, columnar MPP SQL product and present experimental results.

K. T. Sridhar, Jimson Johnson

SIMD Acceleration for Main-Memory Index Structures – A Survey

Index structures designed for disk-based database systems do not fulfill the requirements for modern database systems. To improve the performance of these index structures, different approaches are presented by several authors, including horizontal vectorization with SIMD and efficient cache-line usage.In this work, we compare the adapted index structures Seg-Tree/Trie, FAST, VAST, and ART and evaluate the usage of SIMD within these. We extract important criteria of these adaptations and weight them according to their impact on the performance. As a result, we infer adaptations that are promising for our own index structure Elf.

Marten Wallewein-Eising, David Broneske, Gunter Saake

OpenMP as an Efficient Method to Parallelize Code with Dense Synchronization

In recent years, adding new cores and new threads are main methods to add computational power. In line with this approach in this paper we analyze the efficiency of the parallel computational model with shared memory, when dense synchronization is required. As our experimental evaluation shows, contemporary CPUs assisted with OpenMP library perform well in case of such tasks. We also present evidence that OpenMP is easy to learn and use.

Rafał Bocian, Dominika Pawłowska, Krzysztof Stencel, Piotr Wiśniewski

Memory Management Strategies in CPU/GPU Database Systems: A Survey

GPU-accelerated in-memory database systems have gained a lot of popularity over the last several years. However, GPUs have limited memory capacity, and the data to process might not fit into the GPU memory entirely and cause a memory overflow. Fortunately, this problem has many possible solutions, like splitting the data and processing each portion separately, or storing the data in the main memory and transferring it to the GPU on demand. This paper provides a survey of four main techniques for managing GPU memory and their applications for query processing in cross-device powered database systems.

Iya Arefyeva, David Broneske, Gabriel Campero, Marcus Pinnecke, Gunter Saake

Formulation of Composite Discrete Measures for Estimating Uncertainties in Probabilistic Databases

The probabilistic databases contain large datasets embedded with noise and uncertainties in data association rules and queries. The data identification and interpretation in probabilistic databases require probabilistic models for data clustering and query processing. Thus, the associated probability measures are required to be heterogeneous as well as computable. This paper proposes a formal model of composite discrete measures in metric spaces intended to probabilistic databases. The proposed composite measures are computable and cover real as well as complex spaces. The spaces of discrete measures are constructed on continuous smooth functions. This paper presents construction of the formal model and computational evaluations of discrete measures following different functions having varying linearity and smoothness. Furthermore, a special monotone class of the composite discrete measure is presented using analytical formulation. The condensation measure of uniform contraction map is constructed. The proposed model can be employed to computationally estimate uncertainties in probabilistic databases.

Susmit Bagchi

Impact of Storage Space Configuration on Transaction Processing Performance for Relational Database in PostgreSQL

An information system often uses relational database as a data store. One of the reasons for the popularity of relational databases is transaction processing, which helps to preserve data consistency. The configuration of storage space in database management system has significant influence on efficiency of transaction processing, which is crucial to workload processing in information system. The choice of block device and filesystem for local storage in database management system affects transactions performance in relational databases. This paper shows what impact on database transaction efficiency has usage of modern hard drive versus solid state drive. It also compares database performance when relational database is stored in volatile memory. Finally, it demonstrates how selection of filesystem type for DBMS local storage influences transaction efficiency in supported databases. In this research PostgreSQL was used as powerful, open source relational database management system, which was installed and configured in GNU/Linux operating system.

Mateusz Smolinski

Artificial Intelligence, Data Mining and Knowledge Discovery

Frontmatter

Optimization of Approximate Decision Rules Relative to Length

In the paper, a modified dynamic programming approach for optimization of decision rules relative to length is studied. Experimental results connected with length of approximate decision rules, size of a directed acyclic graph, and accuracy of classifiers, are presented.

Beata Zielosko, Krzysztof Żabiński

Covering Approach to Action Rule Learning

Action rules specify recommendations which should be followed in order to transfer objects to the desired decision class. This paper presents a proposal of a novel method for induction of action rules directly from a dataset. The proposed algorithm follows the so-called covering schema and employs a pruning procedure, thus being able to produce comprehensible rule sets. An experimental study shows that the proposed method is able to discover strong actions of superior accuracy.

Paweł Matyszok, Marek Sikora, Łukasz Wróbel

Genetic Selection of Training Sets for (Not Only) Artificial Neural Networks

Creating high-quality training sets is the first step in designing robust classifiers. However, it is fairly difficult in practice when the data quality is questionable (data is heterogeneous, noisy and/or massively large). In this paper, we show how to apply a genetic algorithm for evolving training sets from data corpora, and exploit it for artificial neural networks (ANNs) alongside other state-of-the-art models. ANNs have been proved very successful in tackling a wide range of pattern recognition tasks. However, they suffer from several drawbacks, with selection of appropriate network topology and training sets being one of the most challenging in practice, especially when ANNs are trained using time-consuming back-propagation. Our experimental study (coupled with statistical tests), performed for both real-life and benchmark datasets, proved the applicability of a genetic algorithm to select training data for various classifiers which then generalize well to unseen data.

Jakub Nalepa, Michal Myller, Szymon Piechaczek, Krzysztof Hrynczenko, Michal Kawulok

Decision Trees as Interpretable Bank Credit Scoring Models

We evaluate several approaches to classification of loan applications that provide their final results in the form of a single decision tree, i.e., in the form widely regarded as interpretable by humans. We apply state-of-the-art credit scoring-oriented classification algorithms, such as logistic regression, gradient boosting decision trees and random forests, as components of the proposed algorithms of decision tree building. We use four real-world loan default prediction data sets of different sizes. We evaluate the proposed methods using the area under the receiver operating characteristic curve (AUC) but we also measure the models’ interpretability. We verify the significance of differences between AUC values observed when using the compared techniques by measuring Friedman’s statistic and performing Nemenyi’s post-hoc test.

Andrzej Szwabe, Pawel Misiorek

Comparison of Selected Fusion Methods from the Abstract and Rank Levels in a System Using Pawlak’s Approach to Coalition Formation

In this paper, a decision system that uses dispersed knowledge is considered. In particular, an ensemble of classifiers in which the relations between classifiers are analyzed and coalitions of classifiers that are formed is discussed. In a previous work, the use of Pawlak’s conflict model in order to create such coalitions was proposed. In this paper, four fusion methods are used in this system – two from the abstract level and two from the rank level. The results that were obtained using these four methods were compared and some conclusions are presented in this paper.

Małgorzata Przybyła-Kasperek

The Classification of Music by the Genre Using the KNN Classifier

The article presents the possibility of classifying music tracks according to their musical genre. This issue is interesting because it is difficult to find solutions that look for similarity between songs based on their waveforms, as in this work. This article shows that such a classification is possible. For this process, the KNN classifier was used, for which it is possible to apply different metrics (metric spaces). The article shows the validity of testing different distance measures in the classification process. The analysis of music tracks and assignment to the appropriate genre is carried out, on the basis of attributes describing the music track. These attributes are obtained using the jAudio library. The development of further research in this area may allow finding other suitable music not only on the basis of historical data about the user (what he was listening to along with the music track) but also directly on the basis of the genre of the given song.

Daniel Kostrzewa, Robert Brzeski, Maciej Kubanski

Text Mining, Natural Language Processing, Ontologies and Semantic Web

Frontmatter

An Interactive Knowledge Maintenance Algorithm for Recasting WordNet Synonym-Set Definitions into Lojbanic Primitives, then into Lojbanic English

Lojban, a constructed interlingua, has a small number of predicates–totaling 1342. In our prior independent work, involving lojban for machine translation in parallel, when converting English into lojbanic English (our extensions), the number of test cases was too small, because of the lack of a rich lojban vocabulary. There is a procedure for creating new, composite, lojban predicates in terms of existing predicates. Although workable, it lacks an architectural framework that is consistent with English word-sense usage, and enables efficient production of thousands of new predicates.To address lojban’s insufficient vocabulary, we developed an interactive algorithm which will recast WordNet synonym sets’ (synset) definitions into existing lojban primitive predicates. The output is in terms of our lojbanic English. The final output is a new dictionary of synset definitions, with lojban semantics, but using lojbanic English. The linguistic engineer, interactively, lojbanizes a synset’s definition. Thus, the synsets are well defined, unambiguous, and there are no circular definitions–no confusion–because lojban primitives have these properties.If a relevant subset, e.g., 1 / 10, of the unique synset definitions–totaling 116718, are converted into lojban predicates, then 1.945 man years would be required for this effort. This algorithm can be run in parallel on distinct subsets, employing many users.

Luke Immes, Haim Levkowitz

Tensor-Based Ontology Data Processing for Semantic Service Matchmaking

In this paper, we present a new application of multilinear data processing to Semantic Web Service matchmaking that is based on the Covariance-Matrix-based Filtering (CMF) algorithm and ontology data representation. We show advisability of integrated algebraic modeling of lexical data derived from web service descriptions and the corresponding ontology-based semantic data. The experimental evaluation results indicate superiority of the covariance-based tensor filtering method over other state-of-the-art tensor processing methods, as well as the advantages of using the proposed ontology data representation.

Andrzej Szwabe, Paweł Misiorek, Michał Ciesielczyk, Jarosław Bąk

Metadata Reconciliation for Improved Data Binding and Integration

Data Integration has been a consistent concern in the Linked Open Data (LOD) research. The data integration problem (DIP) depends upon many factors. Primarily the nature and type of datasets guide the integration process. Every day, the demand for open and improved data visualization is increasing. Organizations, researchers and data scientists all require more improved techniques for data integration that can be used for analytics and predictions. The scientific community has been able to construct meaningful solutions by using the power of metadata. The metadata is powerful if it is properly guided. There are several existing methodologies that improve system semantics using metadata. However, the data integration between heterogeneous resources for example structured and unstructured data is still a far fetched reality. Metadata can not only improve but effectively increase semantic search performance if properly reconciled with the available information or standard data. In this paper, we present a metadata reconciliation strategy for improving data integration and data classification between data sources that correspond to a certain standard of similarity. The data similarity can be deployed as a power tool for linked data operations. The data publishing and connection over the LOD can effectively be improved using reconciliation strategies. In this paper, we also briefly define the procedure of reconciliation that can semi-automate the interlinking and validation process for publishing linked data as an integrated resource.

Hiba Khalid, Esteban Zimanyi, Robert Wrembel

Full-Text Search Extensions for JSON Documents: Design Goals and Implementations

One of main advantages of JSON (JavaScript Object Notation) is to represent structured as well as semi-structured data at the same time. Although the existing SQL/JSON standard specifies the way, how queries in relation to exact search can be preformed, it lacks the support for queries concerning full-text search. In this article we propose a set of design goals that full-text search extension of the SQL/JSON language should support. Additionally, the given set is valid for any full-text search language in relation to JSON documents. Also, we discuss full-text language extensions implemented in relational database systems concerning JSON documents and answer the question to what extent the extensions are supported by these systems.

Dušan Petkovic

How Poor Is the “Poor Man’s Search Engine”?

The modern world generates huge amounts of documents each day. Text data is ubiquitous in the digital space. They can contain information about products in an online store, the opinions of a blog author, reportage in a newspaper or questions and advice from online forums. Most of this data is managed using DBMS - mainly relational ones. Thus, the more crucial it becomes to find the most efficient use of the available text search mechanisms. This work examines the basic word search methods in the two of the most popular open DBMS: PostgreSQL and MariaDB. The results of the empirical tests will serve as a starting point for discussion is the “Poor Man’s Search Engine” SQL antipattern still an antipattern?

Marta Burzańska, Piotr Wiśniewski

Image Analysis and Multimedia Mining

Frontmatter

Computer Software for Selected Plant Species Segmentation on Airborne Images

The usage of manned and unmanned flying devices for aerial images acquisition becomes more and more popular nowadays. Though unmanned aerial vehicles (aka drones) are capable of an easy and efficient gathering of spatial information, there is a growing need for efficient tools supporting its analysis and processing. The software solution presented in this article is aimed at identification and characterization of the selected plant species on industrial waste dumps based on aerial images analysis. The software uses back projection method for segmentation of areas covered by Solidago canadensis (Goldenrod), which is known as an invasive one. The software assists the user in the segmentation of areas covered by the identified plants and characterizing their parameters. The application implements selected methods helping in image preprocessing, segmentation, weed identification and calculation of segmented areas shape parameters. The Solidago canadensis segmentation tests were performed with satisfactory results.

Sebastian Iwaszenko, Marcin Kelm

Automatic Segmentation of Corneal Endothelium Images with Convolutional Neural Network

A fully-automatic segmentation of corneal endothelial images is addressed in this paper. It can find its application in the medicine removing the burden of manual annotations from the physicians allowing for faster patient diagnosis. The proposed system is based on pre-trained convolutional neural network AlexNet and uses a transfer learning methodology to build a system for delineation of endothelial cells. The training is based on the classification of small patches of an image which represents cell body or cell border class. The validation set proved that 99% correct classification ratio accuracy and F1 score were achieved. Exploiting this network in a system configured for segmentation it proved very good detection of cell bodies and supported by best-fit skeletonization allowed to locate cell borders precisely.

Karolina Nurzynska

A Practical Application of Skipped Steps DWT in JPEG 2000 Part 2-Compliant Compressor

In this paper, we evaluate effects of applying the fixed skipped steps discrete wavelet transform (fixed SS-DWT) variants in the lossless compression that is compliant with part 2 of the JPEG 2000 standard. Compared to results obtained previously using a modified JPEG 2000 part 1 compressor, for a large and diverse set of test images, we found that extensions of part 2 of the standard allow further bitrate improvements. We experimentally confirmed that the fixed SS-DWT variants may be obtained in compliance with the standard and we identified practical JPEG 2000 part 2-compliant compression schemes with various trade-offs between the bitrate improvement and the compression process complexity.

Roman Starosolski

Optimal Parameter Search for Colour Normalization Aiding Cell Nuclei Segmentation

Automatic segmentation of biological images is necessary to allow faster diagnosis of several diseases. There are numerous methods addressing this problem, yet any general solution has been proposed. Probably it might result from the lacking standardization in tissue staining by hematoxylin and eosin, which is necessary to better visualize cell structure. The colour space normalization seems to be a perfect solution, but choosing adequate parameters is still a difficult task. Therefore, in this work, a Monte Carlo Simulation method is applied to search for a set of parameters assuring the best performance of colour transfer normalization technique. The segmentation accuracy is evaluated for each parameter set on a dataset containing colon tissue. Three accuracy metrics are computed to compare manually prepared masks with those achieved automatically: the Dice coefficient, specificity, and sensitivity. The analysis of the aggregated results proved that it is possible to find a sub-space where the worst results are placed, and depending on the accuracy measure it is possible to find a plane dividing those results.

Karolina Nurzynska

B4MultiSR: A Benchmark for Multiple-Image Super-Resolution Reconstruction

Super-resolution reconstruction (SRR) methods consist in processing single or multiple images to increase their spatial resolution. Deployment of such techniques is particularly important, when high resolution image acquisition is associated with high cost or risk, like for medical or satellite imaging. Unfortunately, the existing SRR techniques are not sufficiently robust to be deployed in real-world scenarios, and no real-life benchmark to validate multiple-image SRR has been published so far. As gathering a set of images presenting the same scene at different spatial resolution is not a trivial task, the SRR methods are evaluated based on different assumptions, employing various metrics and datasets, often without using any ground-truth data. In this paper, we introduce a new multi-layer benchmark dataset for systematic evaluation of multiple-image SRR techniques with particular reference to satellite imaging. We hope that the new benchmark will help the researchers to improve the state of the art in SRR, making it suitable for real-world applications.

Daniel Kostrzewa, Łukasz Skonieczny, Paweł Benecki, Michał Kawulok

Deep Learning Features for Face Age Estimation: Better Than Human?

Deep convolutional neural networks have the ability to infer highly representative features from data. We decided to use this power for the purpose of estimating the human face age from a single colour image. We trained the Support Vector Machine regression model on raw feature vectors from the FaceNet deep neural network pretrained for face recognition. Our proposed method is a simple but effective FaceNet extension which does not need large scale data. In order to measure the accuracy of our approach, we proposed a test procedure on the FACES database for which we achieved the mean absolute error of 5.18 and the mean error of 0.09 years. Then, we conducted an experiment employing 78 students and showed that our method outperforms human for faces in the regular upright orientation. For vertically inverted faces, we reported an age underestimation trend in responses of students and our method.

Krzysztof Kotowski, Katarzyna Stapor

The Use of Minimal Geometries in Automated Building Generalization

As one of the components of automatic mapping, building generalization is one of the most difficult. The complexity of this process is related to the fact that, in addition to the algorithms used to simplify geometric structure, we must also take into account procedures that maintain the topological relations of the neighborhood. Nevertheless, the choice of the correct simplification method is a crucial task. Therefore, this article presents two new simplification algorithms designed by the authors, Area- and Orientation-Maintained Rectangle (AaOMR) and Topological-Diagonal Maxima (TDM). Two new methods and three commonly used ones, Minimum Bounding Rectangle by Width (RbW), Minimum Bounding Rectangle by Area (RbA), and Building Envelope (E) were compared to each other. The research tests of these algorithms cover comparison of several parameters, shifting the centroid, change in area, minimal width and displacement of vertices. Additionally, the proposed algorithms are attached to this article as ready-to-use GIS toolboxes.

Michał Lupa, Stanisław Szombara, Krystian Kozioł, Michał Chromiak

Data Mining Applications

Frontmatter

Mini-expert Platform for Pareto Multi-objective Optimization of Geophysical Problems

In this paper, a mini-expert platform for joint inversion is presented. The Pareto inversion scheme was applied to eliminate any typical problems of this kind of inversion, such as arbitrarily chosen target function weights and laborious interactivity. Particle Swarm Optimization was used as the main optimization engine. The presented solution is written entirely in JavaScript and provides easy access to core system functions, even for non-technical users. As an example, a geophysical problem of joint inversion of surface waves was chosen, but the solution is capable of inverting any kind of data as long as two or more target functions can be provided. All obtained results were compared with software written by the authors in C in terms of both results and efficiency.

Adrian Bogacz, Tomasz Danek, Katarzyna Miernik

A CANoe-Based Approach for Receiving XML Data over the Ethernet

Although the use of XML format allows many interoperability problems to be solved, it is not common solution in automotive electronics. Most of development systems used in automotive base on raw data instead of structured information. The paper presents an extension of CANoe software with a XML-based communication interface, used to receive and process measurement data from an Autonomous Mobile Platform through specialized database system. The solution provides interoperability in access to the measurement data as well as wide possibilities to use them in the visualisation, historical analysis and assessment of the quality of work of a platform.

Marek Drewniak, Marcin Fojcik, Damian Grzechca, Michal Kruk

Expert System Supporting the Diagnosis of the Wind Farm Equipments

An expert system that supports diagnosing of wind farm equipments is presented in this paper. First, the created functional and diagnostic models were presented for two basic elements of farms, such as: the wind power plant (turbine) and the electrical substation. Next, a division was made of the aforementioned elements into internal structure components (blocks), and diagnostic signals were determined for them. Based on these signals and their properties, a set of input data, parameters and an expert knowledge base in the form of facts and rules were developed. On further stages, the inference process of the expert system was characterized and the individual paths of obtaining a diagnosis of the working condition of wind farm devices were described. Additionally, the graphical user interface was discussed, and the manner of the presentation of inference process results was explained for general and detailed diagnosis.

Dariusz Bernatowicz, Stanisław Duer, Paweł Wrzesień

The Diagnostic System with an Artificial Neural Network for Identifying States in Multi-valued Logic of a Device Wind Power

The present article covers the idea of the examination of the value of the k-th logics of diagnostic information related to the assessment of the states of complex technical items. For this purpose, an intelligent diagnostic system was presented whose particular property is the possibility to select any k-th logic of inference from set {k = 4, 3, 2}. An important part of this study is the presentation of theoretical grounds that describe the idea of inference in the multi-valued logic examined. Furthermore, it was demonstrated that the permissible range of the values of the properties of diagnostic signals constitutes the basis of the classification of states in multi-valued logic in the DIAG 2 diagnostic system. For this purpose, a procedure of the classification of states in selected values of multi-valued logic was presented and described. An important element in the functioning of diagnostic systems, i.e. the module of inference was presented, as well. The rules of diagnostic inference were characterized and described based on which the process of inference is realized in the system.

Stanisław Duer, Dariusz Bernatowicz, Paweł Wrzesień, Radosław Duer

Experimental Measurements of the Packet Burst Ratio Parameter

In computer networking, the burst ratio is a parameter of the packet loss process, containing information about the tendency of losses to occur in blocks, rather than as separate units. Its value is especially important for real-time multimedia transmissions. In this paper, we report the measurements of the value of this parameter carried out in the networking laboratory. These measurements involved high volumes of traffic, different numbers of flows, different TCP/UDP traffic proportions and different packet sizes. In every case, a high value of the burst ratio was obtained. This is an experimental confirmation of the conjecture that the buffering mechanisms, commonly used in the contemporary networks, make the packet losses to group together.

Dominik Samociuk, Andrzej Chydzinski, Marek Barczyk

ALMM Solver - Idea of Algorithm Module

The aim of the paper is to propose architecture of algorithms module of IT tool, names Solver ALMM. It is framework for solving collective decision-making problems. The solver belongs to the group of applications based on specialized problem model, which provides solutions (exact or approximate) for NP-hard of discrete optimization problems using artificial intelligence methods. It is based on the methodology of algebraic-logical meta-model of discrete decision processes. The article presents principles of cooperation Algorithm Module with the other components. Also hot spots as specific locations for SimOpt framework extension are presented.

Edyta Kucharska, Krzysztof Ra̧czka

Improved Data Analysis, a Step Towards Factory 4.0 - A Preliminary Study in a Car Assembly Plant

Effective data analysis is one of the key characteristics of the Smart Factory, a term that comes from the concept of Industry 4.0 currently being discussed worldwide. This paper presents an attempt to introduce data mining methods for improved data analysis in a car assembly plant. The presented pilot study, on an example of wheel alignment adjustment process, aims to find correlations between earlier production data and the results at the end of the assembly line for process improvement and problem-solving support. Preliminary findings, along with expected results and benefits are provided. Finally, directions and issues for the further research are presented.

Mariusz Rodzen

Biometric Identification Using Gaze and Mouse Dynamics During Game Playing

The paper presents a method, developed for identifying people based on their mouse and gaze dynamics obtained between two mouse clicks. The data used to evaluate the method was collected when participants were playing a simple shooting game. Various statistics were calculated taking mouse and gaze speed and acceleration into account. 24 participants took part in the experiment conducted to check if the proposed method may be applied for identification and authentication purposes. Although, the obtained averaged results (EER 11% and F1-score 90%) showed that statistics calculated for a combination of recorded mouse and gaze positions may be successfully used for authenticating people, it must be noticed that there were significant differences in performance among participants. For about half of them the results were satisfactory, with the best EER 4% and F1-score 99%, while for the worst participant EER equal to 23% and F1-score to 76% were obtained. These results suggest that finding one set of features that is suitable for every person may be a challenging task. It may imply that for behavioral biometric building separate sets of features for each enrolled person should be considered.

Paweł Kasprowski, Katarzyna Harezlak

Backmatter

Title: Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety
Editors: Stanisław Kozielski
Dariusz Mrozek
Paweł Kasprowski
Bożena Małysiak-Mrozek
Daniel Kostrzewa
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-99987-6
Print ISBN: 978-3-319-99986-9
DOI: https://doi.org/10.1007/978-3-319-99987-6

Springer Professional

About this book

Table of Contents

Frontmatter

Big Data and Cloud Computing

Frontmatter

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis

Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud

EYE: Big Data System Supporting Preventive and Predictive Maintenance of Robotic Production Lines

Architectures, Structures and Algorithms for Efficient Data Processing

Frontmatter

SINGLE vs. MapReduce vs. Relational: Predicting Query Execution Time

EvOLAP Graph – Evolution and OLAP-Aware Graph Data Model

Entropy Aware Adaptive Compression for SQL Column Stores

SIMD Acceleration for Main-Memory Index Structures – A Survey

OpenMP as an Efficient Method to Parallelize Code with Dense Synchronization

Memory Management Strategies in CPU/GPU Database Systems: A Survey

Formulation of Composite Discrete Measures for Estimating Uncertainties in Probabilistic Databases

Impact of Storage Space Configuration on Transaction Processing Performance for Relational Database in PostgreSQL

Artificial Intelligence, Data Mining and Knowledge Discovery

Frontmatter

Optimization of Approximate Decision Rules Relative to Length

Covering Approach to Action Rule Learning

Genetic Selection of Training Sets for (Not Only) Artificial Neural Networks

Decision Trees as Interpretable Bank Credit Scoring Models

Comparison of Selected Fusion Methods from the Abstract and Rank Levels in a System Using Pawlak’s Approach to Coalition Formation

The Classification of Music by the Genre Using the KNN Classifier

Text Mining, Natural Language Processing, Ontologies and Semantic Web

Frontmatter

An Interactive Knowledge Maintenance Algorithm for Recasting WordNet Synonym-Set Definitions into Lojbanic Primitives, then into Lojbanic English

Tensor-Based Ontology Data Processing for Semantic Service Matchmaking

Metadata Reconciliation for Improved Data Binding and Integration

Full-Text Search Extensions for JSON Documents: Design Goals and Implementations

How Poor Is the “Poor Man’s Search Engine”?

Image Analysis and Multimedia Mining

Frontmatter

Computer Software for Selected Plant Species Segmentation on Airborne Images

Automatic Segmentation of Corneal Endothelium Images with Convolutional Neural Network

A Practical Application of Skipped Steps DWT in JPEG 2000 Part 2-Compliant Compressor

Optimal Parameter Search for Colour Normalization Aiding Cell Nuclei Segmentation

B4MultiSR: A Benchmark for Multiple-Image Super-Resolution Reconstruction

Deep Learning Features for Face Age Estimation: Better Than Human?

The Use of Minimal Geometries in Automated Building Generalization

Data Mining Applications

Frontmatter

Mini-expert Platform for Pareto Multi-objective Optimization of Geophysical Problems

A CANoe-Based Approach for Receiving XML Data over the Ethernet

Expert System Supporting the Diagnosis of the Wind Farm Equipments

The Diagnostic System with an Artificial Neural Network for Identifying States in Multi-valued Logic of a Device Wind Power

Experimental Measurements of the Packet Burst Ratio Parameter

ALMM Solver - Idea of Algorithm Module

Improved Data Analysis, a Step Towards Factory 4.0 - A Preliminary Study in a Car Assembly Plant

Biometric Identification Using Gaze and Mouse Dynamics During Game Playing

Backmatter

Premium Partner