nach oben

2017 | Buch

Kapitel lesen Erstes Kapitel lesen

Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation

13th International Conference, BDAS 2017, Ustroń, Poland, May 30 - June 2, 2017, Proceedings

herausgegeben von: Stanisław Kozielski, Dariusz Mrozek, Paweł Kasprowski, Bożena Małysiak-Mrozek, Daniel Kostrzewa

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 13th International Conference entitled Beyond Databases, Architectures and Structures, BDAS 2017, held in Ustroń, Poland, in May/June 2017.

It consists of 44 carefully reviewed papers selected from 118 submissions. The papers are organized in topical sections, namely big data and cloud computing; artificial intelligence, data mining and knowledge discovery; architectures, structures and algorithms for efficient data processing; text mining, natural language processing, ontologies and semantic web; bioinformatics and biological data analysis; industrial applications; data mining tools, optimization and compression.

Inhaltsverzeichnis

Frontmatter

Big Data and Cloud Computing

Frontmatter

Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP)

Works in the field of data warehousing (DW) do not address Stream Processing (SP) integration in order to provide results freshness (i.e. results that include information that is not yet stored into the DW) and at the same time to relax the DW processing load. Previous research works focus mainly on parallelization, for instance: adding more hardware resources; parallelizing operators, queries, and storage. A very known and studied approach is to use Map-Reduce to scale horizontally in order to achieve more storage and processing performance. In many contexts, high-rate data needs to be processed in small time windows without storing results (e.g. for near real-time monitoring), in other cases, the objective is to relax the data warehouse usage (e.g. keeping results updated for web-pages reload). In both cases, stream processing solutions can be set to work together with the data warehouse (Map-Reduce or not) to keep results available on the fly avoiding high query execution times, and, this way leaving the DW servers more available to process other heavy tasks (e.g. data mining).In this work, we propose the integration of Stream Processing and Map-Reduce (MRSP) for better query and DW performance. This approach allows to relax the data warehouse load, and, by consequence reducing the network usage. This mechanism integrates into Map-Reduce scalability mechanisms and uses the Map-Reduce nodes to process Stream queries.Results show/compare performance gains on the DW side and the quality of experience (QoE) when executing queries and loading data.

Pedro Martins, Maryam Abbasi, José Cecílio, Pedro Furtado

Tensor-Based Modeling of Temporal Features for Big Data CTR Estimation

In this paper we propose a simple tensor-based approach to temporal features modeling that is applicable as means for logistic regression (LR) enhancement. We evaluate experimentally the performance of an LR system based on the proposed model in the Click-Through Rate (CTR) estimation scenario involving processing of very large multi-attribute data streams. We compare our approach to the existing approaches to temporal features modeling from the perspective of the Real-Time Bidding (RTB) CTR estimation scenario. On the basis of an extensive experimental evaluation, we demonstrate that the proposed approach enables achieving an improvement of the quality of CTR estimation. We show this improvement in a Big Data application scenario of the Web user feedback prediction realized within an RTB Demand-Side Platform.

Andrzej Szwabe, Pawel Misiorek, Michal Ciesielczyk

Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework

In this contribution, we present our approach to querying XML document that is stored in a distributed system. The main goal of this paper is to describe how to use Spark SQL framework to implement a subset of expressions from XPath query language. Five different methods of our approach are introduced and compared, and by this, we also demonstrate the actual state of query optimization on Spark SQL platform. It may be taken as the next contribution of our paper. A subset of expressions from XPath query language (supported by the implemented methods) contains all XPath axes except the axes of attribute and namespace while predicates are not implemented in our prototype. We present our implemented system, data, measurements, tests, and results. The evaluated results support our belief that our method significantly decreases data transfers in the distributed system that occur during the query evaluation.

Radoslav Hricov, Adam Šenk, Petr Kroha, Michal Valenta

Metrics-Based Auto Scaling Module for Amazon Web Services Cloud Platform

One of the key benefits of moving an application to the cloud is the ability to easy scale horizontally when the workload increases. Many cloud providers offer a mechanism of auto scaling which dynamically adjusts the number of virtual server instances, on which given system is running, according to some basic resource-based metrics like CPU utilization. In this work, we propose a model of auto scaling which is based on timing statistics: a high order quantile and a mean value, which are calculated from custom metrics, like execution time of a user request, gathered on application level. Inputs to the model are user defined values of those custom metrics. We developed software module that controls a number of virtual server instances according to both auto scaling models and conducted experiments that show our model based on custom metrics can perform better, while it uses less instances and still maintains assumed time constraints.

Dariusz Rafal Augustyn, Lukasz Warchal

Artificial Intelligence, Data Mining and Knowledge Discovery

Frontmatter

Comparison of Two Versions of Formalization Method for Text Expressed Knowledge

The Node of Knowledge (NOK) method is a method for knowledge representation. It is used as a basis for development of formalism for textual knowledge representation (FNOK). Two versions of formalization methods and, respectively, two Question Answering (QA) systems are developed. The first system uses grammars; it is written and implemented in Python. The second, improved system is based on storing text in relational databases without losing semantics and it is implemented in Oracle.This paper presents the results of comparison of the two QA systems. The first system was tested using 42 sentences. It received 88 questions from users and provided answers. After improving the formalization method, the second system was tested with the same set of sentences and questions. The paper presents the results of the testing, the comparison of answers received from both systems and the analysis of correctness of the answers received.

Martina Asenbrener Katic, Sanja Candrlic, Mile Pavlic

Influence of Similarity Measures for Rules and Clusters on the Efficiency of Knowledge Mining in Rule-Based Knowledge Bases

In this work the subject of the application of clustering as a knowledge extraction method from real-world data is discussed. The authors analyze the influence of different clustering parameters on the efficiency of the knowledge mining process for rules/rules clusters. In the course of the experiments, nine different objects similarity measures and four clusters similarity measures have been examined in order to verify their impact on the size of the created clusters and the size of their representatives. The experiments have revealed that there is a strong relationship between the parameters used in the clustering process and future efficiency levels of the knowledge mined from such structures: some parameters guarantee to produce shorter/longer representatives of the created rules clusters as well as smaller/greater clusters’ sizes.

Agnieszka Nowak-Brzezińska, Tomasz Rybotycki

Attribute Reduction in a Dispersed Decision-Making System with Negotiations

The aim of the study was to apply rough set attribute reduction in a dispersed decision-making system. The system that was used was proposed by the author in a previous work. In this system, a global decision is taken based on the classifications that are by the base classifiers. In the process of decision-making, elements of conflict analysis and negotiations have been applied. Reduction of the set of conditional attributes in local decision tables was used in the paper. The aim of the study was to analyze and compare the results that were obtained after the reduction with the results that were obtained for the full set of attributes.

Małgorzata Przybyła-Kasperek

Adjusting Parameters of the Classifiers in Multiclass Classification

The article presents the results of the optimization process of classification for five selected data sets. These data sets contain the data for the realization of the multiclass classification. The article presents the results of initial classification, carried out by dozens of classifiers, as well as the results after the process of adjusting parameters, this time obtained for a set of selected classifiers. At the end of article, a summary and the possibility of further work are provided.

Daniel Kostrzewa, Robert Brzeski

Data Mining - A Tool for Migration Stock Prediction

The migration phenomenon is an important issue for most of the European Unions countries and it has a major socio-economic impact for all parts involved. After 1989, a massive migration process started to develop from Romania towards Western European countries. Beside qualified personnel in search of different and new opportunities, Roma people became more visible, as they were emigrating in countries with high living standards where they were generating significant integration problems along with costs. In order to identify the problems faced by the Roma community from Rennes, a group of sociologists developed a questionnaire, which contains, among other questions, one relating to the intention of returning home. This paper presents a research that aims to build various models, by data mining techniques, to predict that Roma people return to the home country after a five years interval. The second goal is to assess these models and to identify those aspects that have most influence in the decision-making process. The result is based on the data completed by more than 100 persons from Rennes.

Mirela Danubianu

A Survey on Data Mining Methods for Clustering Complex Spatiotemporal Data

This publication presents a survey on the clustering algorithms proposed for spatiotemporal data. We begin our study with definitions of spatiotemporal datatypes. Next we provide a categorization of spatiotemporal datatypes with the special emphasis on the spatial representation and diversity in temporal aspect. We conduct our deliberation focusing mainly on the complex spatiotemporal objects. In particular, we review algorithms for two problems already proposed in literature: clustering complex spatiotemporal objects as polygons or geographical areas and measuring distances between complex spatial objects. In addition to description of the problems mentioned above, we also attempt to provide a comprehensive references review and provide a general look on the different problems related to the clustering spatiotemporal data.

Piotr S. Maciąg

Architectures, Structures and Algorithms for Efficient Data Processing

Frontmatter

Multi-partition Distributed Transactions over Cassandra-Like Database with Tunable Contention Control

The amounts of data being processed today are enormous and they require specialized systems to store them, access them and do computations. Therefore, a number of NoSql databases and big data platforms were built to address this problem. They usually lack transaction support which features atomicity, consistency, isolation, durability and at the same time they are distributed, scalable, and fault tolerant. In this paper, we present a novel transaction processing framework based on Cassandra storage model. It uses Paxos protocol to provide atomicity and consistency of transactions and Cassandra specific read and write paths improvements to provide read committed isolation level and durability. Unlike built-in Light Weight Transactions (LWT) support in Cassandra, our algorithm can span multiple data partitions and provides tunable contention control. We verified correctness and efficiency both theoretically and by executing tests over different workloads. The results presented in this paper prove the usability and robustness of the designed system.

Marek Lewandowski, Jacek Lewandowski

The Multi-model Databases – A Review

The following paper presents issues considering multi-model databases. A multi-model database can be understood as a database which is capable of storing data in different formats (relations, documents, graphs, objects, etc.) under one management system. This makes it possible to store related data in a most appropriate (dedicated) format as it comes to the structure of data itself and the processing performance. The idea is not new but since its rising in late 1980s it was not successfully and widely put into practice. The realm of storing and retrieving the data was dominated by the relational model. Nowadays this idea becomes again up-to-date because of the growing popularity of NoSQL movement and polyglot persistence. This article attempts to show the state-of-the-art in multi-model databases area and possibilities of this reconditioned idea.

Ewa Płuciennik, Kamil Zgorzałek

Comparative Analysis of Relational and Non-relational Databases in the Context of Performance in Web Applications

This paper presents comparative analysis of relational and non-relational databases. For the purposes of this paper simple social-media web application was created. The application supports three types of databases: SQL (it was tested with PostgreSQL), MongoDB and Apache Cassandra. For each database the applied data model was described. The aim of the analysis was to compare the performance of these selected databases in the context of data reading and writing. Performance tests showed that MongoDB is the fastest when reading data and PostgreSQL is the fastest for writing. The test application is fully functional, however implementation occurred to be more challenging for Cassandra.

Konrad Fraczek, Malgorzata Plechawska-Wojcik

Using Genetic Algorithms to Optimize Redundant Data

Analytic queries can exhaust resources of the DBMS at hand. Since the nature of such queries can be foreseen, a database administrator can prepare the DBMS so that it serves such queries efficiently. Materialization of partial results (aggregates) is perhaps the most important method to reduce the resource consumption of such queries. The number of possible aggregates of a fact table is exponential in the number of its dimensions. The administrator has to choose a reasonable subset of all possible materialized aggregates. If an aggregate is materialized, it may produce benefits during a query execution but also instigate a cost during data maintenance (not to mention the space needed). Thus, the administrator faces an optimisation problem: knowing the workload (i.e. the queries and updates to be performed), what is the subset of all aggregates that gives the maximal net benefit? In this paper we present a cost model that defines the framework of this optimisation problem. Then, we compare two methods to compute the optimal subset of aggregates: a complete search and a genetic algorithm. We tested these meta-heuristics on a fact table with 30 dimensions. The results are promising. The genetic algorithm runs significantly faster while yielding solutions within 10% margin of the optimal solution found by the complete search.

Iwona Szulc, Krzysztof Stencel, Piotr Wiśniewski

Interoperable SQLite for a Bare PC

SQLite, a widely used database engine, has been previously transformed to run on a bare PC without the support of any OS or kernel. However, the transformed SQLite database was stored in main memory i.e., it had no file system. This paper extends the transformation process to enable bare PC SQLite to work with standard file system interfaces based on the FAT32 file specification. It further presents mechanisms and programming interfaces for a bare machine file system integrated with SQLite that uses a removable USB flash drive. The bare SQLite database and file system can interoperate with conventional OS-based database systems. It can be adapted in the future to work with bare Web browsers, large bare databases, other bare applications, and bare mobile devices.

William Thompson, Ramesh Karne, Alexander Wijesinha, Hojin Chang

FM-index for Dummies

Full-text search refers to techniques for searching a document, or a document collection, in a full-text database. To speed up such searches, the given text should be indexed. The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rank-handling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2–3 times faster than the fastest known ones, for the price of using typically 1.5–5 times more space.

Szymon Grabowski, Marcin Raniszewski, Sebastian Deorowicz

Lattice Based Consistent Slicer and Topological Cut for Distributed Computation in Monotone Spaces

The distributed database systems are increasingly employing distributed systems platforms for data deployment and query based computation. The models of distributed systems play a role in determining data partitioning and placement in distributed database systems. The applications of concepts of topological spaces are gaining research attention for modeling structures of distributed systems. In a distributed system, the slicer of distributed computation partitions a set of processes into subsets maintaining consistency property. In this paper, a lattice based slicer model of distributed computation is presented considering monotone topological spaces. The model considers state-space of asynchronous distributed computation. The proposed monotone slicer model of computation preserves the lattice cover of Birkhoff’s representation. A set of analytical properties of the monotone slicer model is formulated. Furthermore, the topological cut of an event-based asynchronous distributed computation is formulated as a set of axioms.

Susmit Bagchi

Storage Efficiency of LOB Structures for Free RDBMSs on Example of PostgreSQL and Oracle Platforms

The article is a study upon storage efficiency of LOB structures for systems consisting of PostgreSQL or Oracle relational database management systems. Despite the fact that recently several NoSQL DBMS concepts were born (in particular document-oriented or key-value), relational databases still do not lose their popularity. The main content is focused on sparse data such as XML or base64 encoded data that stored in raw form consume significant volume of data storage. Studies cover both performance (data volume stored per unit of time) and savings (ability to save data storage) aspects.

Lukasz Wycislik

Optimization of Memory Operations in Generalized Search Trees of PostgreSQL

Our team is working on new algorithms for intra-page indexing in PostgreSQL generalized search trees. During this work, we encountered that slight modification of the algorithm for modification of a tuple on a page can significantly affect the performance. This effect is caused by optimization of page compaction operations and speeds up inserts and updates of a data. Most important performance improvement is gained using sorted data insertion, time to insert data into an index can be reduced by a factor of 3. For a randomized data performance increase is around 15%. Size of the index also significantly reduced. This paper describes implementation and evaluation of the technique in PostgreSQL codebase. Proposed patch is committed to upstream and expected to be released with the PostgreSQL 10.

Andrey Borodin, Sergey Mirvoda, Ilia Kulikov, Sergey Porshnev

Text Mining, Natural Language Processing, Ontologies and Semantic Web

Frontmatter

Sorting Data on Ultra-Large Scale with RADULS

New Incarnation of Radix Sort

The paper introduces RADULS, a new parallel sorter based on radix sort algorithm, intended to organize ultra-large data sets efficiently. For example 4 G 16-byte records can be sorted with 16 threads in less than 15 s on Intel Xeon-based workstation. The implementation of RADULS is not only highly optimized to gain such an excellent performance, but also parallelized in a cache friendly manner to make the most of modern multicore architectures. Besides, our parallel scheduler launches a few different procedures at runtime, according to the current parameters of the execution, for proper workload management. All experiments show RADULS to be superior to competing algorithms.

Marek Kokot, Sebastian Deorowicz, Agnieszka Debudaj-Grabysz

Serendipitous Recommendations Through Ontology-Based Contextual Pre-filtering

Context-aware Recommender Systems aim to provide users with better recommendations for their current situation. Although evaluations of recommender systems often focus on accuracy, it is not the only important aspect. Often recommendations are overspecialized, i.e. all of the same kind. To deal with this problem, other properties can be considered, such as serendipity. In this paper, we study how an ontology-based and context-aware pre-filtering technique which can be combined with existing recommendation algorithm performs in ranking tasks. We also investigate the impact of our method on the serendipity of the recommendations. We evaluated our approach through an offline study which showed that when used with well-known recommendation algorithms it can improve the accuracy and serendipity.

Aleksandra Karpus, Iacopo Vagliano, Krzysztof Goczyła

Extending Expressiveness of Knowledge Description with Contextual Approach

In the paper we show how imposing the contextual structure of a knowledge base can lead to extending its expressiveness without changing the underlying language. We show this using the example of Description Logics, which constitutes a base for a range of dialects for expressing knowledge in ontologies (including state-of-the-art OWL).While the contextual frameworks have been used in knowledge bases, they have been perceived as a tool for merging different viewpoints and domains, or a tool for simplifying reasoning by constraining the range of the statements being considered. We show, how it may also be used as a way of expressing more complicated interrelationships between terms, and discuss the import of this fact for authoring ontologies.

Aleksander Waloszek, Wojciech Waloszek

Ontology Reuse as a Means for Fast Prototyping of New Concepts

In the paper we discuss the idea of ontology reuse as a way of fast prototyping of new concepts. In particular, we propose that instead of building a complete ontology describing certain concepts from the very beginning, it is possible and advisable to reuse existing resources. We claim that the available online resources such as Wikidata or wordnets can be used to provide some hints or even complete parts of ontologies aiding new concepts definition. As a proof of concept, we present the implementation of an extension to the Ontolis ontology editor. With this extension we are able to reuse the ontologies provided by Wikidata to define the concepts that have not been previously defined. As a preliminary evaluation of the extension, we compare the amount of work required to define selected concepts with and without the proposed ontology reuse method.

Igor Postanogov, Tomasz Jastrząb

Reading Comprehension of Natural Language Instructions by Robots

We address the problem of robots executing instructions written for humans. The goal is to simplify and speed-up the process of robot adaptation to certain tasks, which are described in human language. We propose an approach, where semantic roles are attached to the components of instructions which lead to robotic execution. However, extraction of such roles from the sentence is not trivial due to the prevalent non determinism of human language. We propose algorithms for extracting actions and object names with roles and explain, how it leads to the robotic execution via attached sub-symbolic information of previous execution examples for rotor assembly and bio(technology) laboratory scenarios. The precision for the main action extraction is 0.977, for the main, primary and secondary objects is 0.828, 0.943 and 0.954, respectively.

Irena Markievicz, Minija Tamosiunaite, Daiva Vitkute-Adzgauskiene, Jurgita Kapociute-Dzikiene, Rita Valteryte, Tomas Krilavicius

A New Method of XML-Based Wordnets’ Data Integration

In the paper we present a novel method of wordnets’ data integration. The proposed method is based on the XML representation of wordnets content. In particular, we focus on the integration of VisDic-based documents representing the data of two Polish wordnets, i.e. plWordNet and Polnet. One of the key features of the method is that it is able to automatically identify and handle the discrepancies existing in the structure of the integrated documents. Apart from the method itself, we briefly discuss a C#-based implementation of the method. Finally, we present some statistical measures related to the data available before and after the integration process. The statistical comparison allows us to determine, among other things, the impact of particular wordnets on the integrated set of data.

Daniel Krasnokucki, Grzegorz Kwiatkowski, Tomasz Jastrząb

Authorship Attribution for Polish Texts Based on Part of Speech Tagging

Authorship attribution aims at identifying the author of an unseen text document based on text samples originating from different authors. In this paper we focus on authorship attribution of Polish texts using stylometric features based on part of speech (POS) tags. Polish language is characterized by high inflection level and in consequence over 1000 POS tags can be distinguished. This allows building a sufficiently large feature space by extracting POS information from documents and performing their classification with use of machine learning methods. We report results of experiments conducted with Weka workbench using combinations of the following features: POS tags, an approximation of their bigrams and simple document statistics.

Piotr Szwed

Fast Plagiarism Detection in Large-Scale Data

This paper presents some research results involved in building Polish semantic Internet search engine called the Natively Enhanced Knowledge Sharing Technologies (NEKST) and its plagiarism detection module. The main goal is to describe tools and algorithms of the engine and its usage within the Open System for Antiplagiarism (OSA).

Radosław Szmit

RDF Validation: A Brief Survey

The last few years have brought a lot of changes in the RDF validation and integrity constraints in the Semantic Web environment, offering more and more options. This paper analyses the current state of knowledge on RDF validation and proposes requirementsL for RDF validation languages. It overviews and compares the previous approaches and development directions in RDF validation. It also points at the pros and cons of particular implementation scenarios.

Dominik Tomaszuk

Bioinformatics and Biological Data Analysis

Frontmatter

Objective Clustering Inductive Technology of Gene Expression Sequences Features

Technology of high dimensional data features objective clustering based on the methods of complex systems inductive modeling is presented in the paper. Architecture of the objective clustering inductive technology as a block diagram of step-by-step implementation of the objects clustering procedure was developed. Method of criterial evaluation of complex data clustering results using two equal power data subsets is proposed. Degree of clustering objectivity evaluates on the basis of complex use of internal and external criteria. Researches on the simulation results of the proposed technology based on the SOTA self-organizing clustering algorithm using the gene expression data obtained by DNA microarray analysis of patients with lung cancer GEOD-68571 Array Express database, the datasets “Compound” and “Aggregation” of the Computing School of the Eastern Finland University and the data “seeds” are presented.

Sergii Babichev, Volodymyr Lytvynenko, Maxim Korobchynskyi, Mochamed Ali Taiff

Novel Computational Techniques for Thin-Layer Chromatography (TLC) Profiling and TLC Profile Similarity Scoring

Thin-layer chromatography (TLC) is an experimental separation technique for multi-compound mixtures widely applied in various fields of industry and research. In contrast to comparable techniques, TLC is straightforward, cost- and time-efficient, and well-applicable in field operations due to its flexibility. In TLC, after applying a mixture sample to the adsorbent layer on the TLC plate, the compounds ascent the plate at different rates due to their individual physicochemical characteristics, whereas separation is eventually achieved.In this paper, we present novel computational techniques for automated TLC plate photograph profiling and fast TLC profile similarity scoring that allow advanced database accessibility for experimental TLC data. The presented methodology thus provides a toolset for automated comparison of plate profiles with gold standard or baseline profile databases. Impurities or sub-standard deviations can be readily identified. Hence, these techniques can be of great value by supporting the pharmaceutical quality assessment process.

Florian Heinke, Rico Beier, Tommy Bergmann, Heiko Mixtacki, Dirk Labudde

Extending the Doctrine ORM Framework Towards Fuzzy Processing of Data

Exemplified by Ambulatory Data Analysis

Extending standard data analysis with the possibility to formulate fuzzy search criteria and benefit from linguistic terms that are frequently used in real life, like small, high, normal, around, has many advantages. In some situations, it allows to extend the set of results by similar cases that would not be possible or difficult with precise search criteria. This is especially beneficial when analyzing biomedical data, where sets of important measurements or biomedical markers describing particular state of a patient or person have similar, but not the same values. In other situations, it allows to generalize the data and aggregate it, and thus, quickly reduce the volume of data from Big to small. Extensions that allow the fuzzy data analysis can be implemented in various layers of the database client-server architecture. In this paper, on the basis of the ambulatory data analysis, we show extensions to the Doctrine object-relational mapping (ORM) layer that allow for fuzzy querying and grouping of crisp data.

Bożena Małysiak-Mrozek, Hanna Mazurkiewicz, Dariusz Mrozek

Segmenting Lungs from Whole-Body CT Scans

Image segmentation is an initial, yet crucial procedure in a number of medical imaging systems. Despite the existence of numerous generic solutions that address this problem, there is still a need for developing fast and accurate techniques specialized at extracting particular organs from the CT scans. In this paper, we present an approach based on simple operations, which is controlled with a few easy-to-adjust parameters and works without any user interaction. The proposed approach, despite its simplicity, was shown to be reliable and efficient for a dataset of over 50 studies, containing both healthy and pathologic lungs.

Maksym Walczak, Izabela Burda, Jakub Nalepa, Michal Kawulok

Improved Automatic Face Segmentation and Recognition for Applications with Limited Training Data

This paper introduces varied pose angle, a new approach to improve face identification given large pose angles and limited training data. Face landmarks are extracted and used to normalize and segment the face. Our approach does not require face frontalization and achieves consistent results. Results are compared using frontal and non-frontal training images for Eigen and Fisher classification of various face pose angles. Fisher scales better with more training samples only with a high quality dataset. Our approach achieves promising results for three well-known face datasets.

Dane Brown, Karen Bradshaw

Emotion Recognition: The Influence of Texture’s Descriptors on Classification Accuracy

This work describes experiments dedicated to analysis of the descriptive properties of several, most widely applied, texture operators in emotion recognition domain. Many researchers apply Gabor filters, histogram of oriented gradients, or local binary patterns in complex set-ups with different classification approaches and image processing methodologies, but nowhere it was verified, how each part of the system influences the resulting performance. Therefore, several experiments with Cohn-Kanade AU-Coded Facial Expression and Karolinska Directed Emotional Faces Databases were performed. These experiments reviled, that exploiting the histogram of oriented gradients overcomes other texture operators in most cases.

Karolina Nurzynska

Industrial Applications

Frontmatter

The Use of the TGŚP Module as a Database to Identify Breaks in the Work of Mining Machinery

The article presents the results of the causes of breaks in selected mining machines work. The studied machines belong to the mechanized longwall system. Identification of the causes of these breaks was carried out using author’s database that was created on the basis of the information coming from the application, which is an integral part of the Means of Production Management Module (TGŚP). This module is one of the basic parts of the integrated enterprise management system SZYK2. The results clearly show that the developed solution enables more efficiently to identify the causes of breaks in examined machines work than previously used systems. It is the result of acquiring the tacit knowledge from dispatchers thanks to developed solutions.

Jarosław Brodny, Magdalena Tutak, Marcin Michalak

A Data Warehouse as an Indispensable Tool to Determine the Effectiveness of the Use of the Longwall Shearer

The effective use of machines and devices in the mining industry is significant. In a competitive energy market, such effectiveness can decide about the further functioning of the company in many cases. The article subject refers to these issues, in particular, to the way of determining the availability of a longwall shearer used in underground coal mining. The article presents the proposal of using the data warehouse to determine the level of load of a longwall shearer during its work on the basis of shearer motor power consumption time series.

Jarosław Brodny, Magdalena Tutak, Marcin Michalak

Computer Software Supporting Rock Stress State Assessment for Deep Coal Mines

Underground deep mining always influences the surrounding rock mass. The excavated voids cause changes in stress fields in their neighbourhood, which can lead to rock bursts. This can cause very serious threat to working people and equipment. Thus properly determined or predicted rock mass state is a crucial information for many purposes in mining activity. Not only does it allow to identify potential rock burst hazard in advance, but it also allows to design the parameters of mining technologies and excavation schedule to minimize the risk. One of the most important parameters describing the rock conditions is the seismic wave propagation anomaly, related to the rock mass stress state.Methods for its prediction have been researched in GIG for over 40 years. The article presents the computer software for calculating a seismic waves propagation anomaly and stress state anomaly. The calculation algorithm based on the method developed in GIG was designed. The algorithm was implemented using C++ language and became a crucial component for calculation supporting computer software. The software was developed as a Windows application and uses Microsoft SQL Server as the database management system. It also allows importing input data from several file formats. The special attention was paid to appropriate handling of spatial and temporal information. The system is capable of visualizing the calculation area as well as exporting the results in Surfer format for further analysis. The developed software is a valuable tool supporting prediction of rock burst threats in deep coal mines.

Sebastian Iwaszenko, Janusz Makówka

An Ontology Model for Communicating with an Autonomous Mobile Platform

This document presents an ontology-based communication interface dedicated for an autonomous mobile platform (AMP). All data between the Platform and other controllers such as PCs or AMPs are exchanged using standardized services. This solution not only allows the required measurement information and its states to be received from an AMP but also control of the Platform. The first advantage is that all of the information is available through an XML file. The second advantage is better possibility for controlling the AMP using external machines that can monitor the route of the AMP. In the case of avoiding obstacles, an external machine can, with the existing sensors and services, help the AMP come back via the correct route. The structure for the data for the Platform is described as set of standardized services such as information about the existing configuration and the status of any installed sensors. The XML format helps to structure information by adding metadata. To create a fully functioning system, it is necessary to add a semantic model (relations between the elements and services) of the AMP services. This paper describes one possible solution for creating ontology model, using the current configuration, services for monitor and services for control of the AMP.

Rafal Cupek, Adam Ziebinski, Marcin Fojcik

Data Mining Tools, Optimization and Compression

Frontmatter

Relational Transition System in Maude

Transition systems in which the state is described by a relational database found applications in artifact centric business process modeling, where the business artifacts are often modeled relationaly. We describe a framework implemented in term rewriting system Maude for specifying and model checking relational transition systems. The system was created to be a part of the future artifact centric business process modeling framework, but is of general interest on its own.

Bartosz Zieliński, Paweł Maślanka

A Performance Study of Two Inference Algorithms for a Distributed Expert System Shell

The rule knowledge-based systems are still popular in the real-world applications and the rules are considered as a standard form of knowledge representation in intelligent information systems. While the number of knowledge-based applications grows, the number of tools for building such systems grows much more slowly. This work is the part of research focused on the development of new methods and tools for building rule-based expert systems. The software components mentioned in this work are the main parts of the distributed expert system shell. The realized implementation assumes, that the inference is performed on the preloaded knowledge base stored in the memory. But such a way of using rule bases may be unrealisable or ineffective for large ones, especially when a weak hardware configuration (mobile applications, embedded systems) is used. In this work the utilization of a database stored procedures is considered. This approach minimizes the network traffic and is independent from the used programming tools—only a connection to the database server is required. The main goal of the experiments was to describe an experimental implementation of the forward chaining inference algorithm (as the stored procedure) and to evaluate this approach in comparison to performing inference on preloaded (real-world) knowledge bases.

Tomasz Xiȩski, Roman Simiński

DUABI - Business Intelligence Architecture for Dual Perspective Analytics

A significant expansion of Big Data and NoSQL databases made it necessary to develop new architectures for Business Intelligence systems based on data organized in a non-relational way. There are many novel solutions combining Big Data technologies with Data Warehousing. However, the proposed solutions are often not sufficient enough to meet the increasing business demands, such as low data latency while still maintaining high functionality, efficiency and reliability of Data Warehouses. In this paper we propose DUABI - the BI architecture that enables both traditional analytics over OLAP Cube as well as near real-time analytics over the data stored in the NoSQL database. The presented architecture leverages features of NoSQL databases for scalability and fault-tolerance with the use of mechanisms like sharding and replication.

Bartosz Czajkowski, Teresa Zawadzka

Comparative Analysis of JavaScript and Its Extensions for Web Application Optimization

This work is dedicated to analysis and comparison of the efficiency of several extensions of JavaScript. Analysis concentrates on the quality of delivered application performance in terms of web page update, database display refreshing, etc. The comparison is performed using three scenarios of: array data display, filling a form, and switching the views between application pages. The research addresses functionality of frameworks and libraries taken under consideration on the personal computer as well as on the mobile device. The results of comparison show, that it is difficult to find one solution, which works well in all circumstances. React, as a view of application, can be recommended for server side flow control, near the database, while Angular should be considered when a clear division into server and client side is sought.

Adam Mlynarski, Karolina Nurzynska

ALMM Solver - Database Structure and Data Access Layer Architecture

The paper presents form of data storage in ALMM Solver and propose the structure of database. The solver is built on Algebraic-Logical Meta-Model of Multistage Decision Process (ALMM of MDP) methodology. Functional and non-functional requirements for data source are described. The detailed structure of database areas (Problem instance, Experiment data, Experiment parameters) and the architecture of the solver’s database communication layer and specific architecture of the SimOpt module from the perspective of communication are presented. Proposed database structure takes into account that not only numeric variables, but also data sets and sequences defined system state can be stored. The paper presents an overview of selected solvers from the data source perspective too.

Krzysztof Ra̧czka, Edyta Kucharska

Human Visual System Inspired Color Space Transform in Lossy JPEG 2000 and JPEG XR Compression

In this paper, we present a very simple color space transform HVSCT inspired by an actual analog transform performed by the human visual system. We evaluate the applicability of the transform to lossy image compression by comparing it, in the cases of JPEG 2000 and JPEG-XR coding, to the ICT/YCbCr and YCoCg transforms for 3 sets of test images. The presented transform is competitive, especially for high-quality or near-lossless compression. In general, while the HVSCT transform results in PSNR close to YCoCg and better than the most commonly used YCbCr transform, at the highest bitrates it is in many cases the best among the tested transforms. The HVSCT applicability reaches beyond the compressed image storage; as its components are closer to the components transmitted to the human brain via the optic nerve than the components of traditional transforms, it may be effective for algorithms aimed at mimicking the effects of processing done by the human visual system, e.g., for image recognition, retrieval, or image analysis for data mining.

Roman Starosolski

Backmatter

Titel: Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation
herausgegeben von: Stanisław Kozielski
Dariusz Mrozek
Paweł Kasprowski
Bożena Małysiak-Mrozek
Daniel Kostrzewa
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-58274-0
Print ISBN: 978-3-319-58273-3
DOI: https://doi.org/10.1007/978-3-319-58274-0

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Big Data and Cloud Computing

Frontmatter

Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP)

Tensor-Based Modeling of Temporal Features for Big Data CTR Estimation

Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework

Metrics-Based Auto Scaling Module for Amazon Web Services Cloud Platform

Artificial Intelligence, Data Mining and Knowledge Discovery

Frontmatter

Comparison of Two Versions of Formalization Method for Text Expressed Knowledge

Influence of Similarity Measures for Rules and Clusters on the Efficiency of Knowledge Mining in Rule-Based Knowledge Bases

Attribute Reduction in a Dispersed Decision-Making System with Negotiations

Adjusting Parameters of the Classifiers in Multiclass Classification

Data Mining - A Tool for Migration Stock Prediction

A Survey on Data Mining Methods for Clustering Complex Spatiotemporal Data

Architectures, Structures and Algorithms for Efficient Data Processing

Frontmatter

Multi-partition Distributed Transactions over Cassandra-Like Database with Tunable Contention Control

The Multi-model Databases – A Review

Comparative Analysis of Relational and Non-relational Databases in the Context of Performance in Web Applications

Using Genetic Algorithms to Optimize Redundant Data

Interoperable SQLite for a Bare PC

FM-index for Dummies

Lattice Based Consistent Slicer and Topological Cut for Distributed Computation in Monotone Spaces

Storage Efficiency of LOB Structures for Free RDBMSs on Example of PostgreSQL and Oracle Platforms

Optimization of Memory Operations in Generalized Search Trees of PostgreSQL

Text Mining, Natural Language Processing, Ontologies and Semantic Web

Frontmatter

Sorting Data on Ultra-Large Scale with RADULS

Serendipitous Recommendations Through Ontology-Based Contextual Pre-filtering

Extending Expressiveness of Knowledge Description with Contextual Approach

Ontology Reuse as a Means for Fast Prototyping of New Concepts

Reading Comprehension of Natural Language Instructions by Robots

A New Method of XML-Based Wordnets’ Data Integration

Authorship Attribution for Polish Texts Based on Part of Speech Tagging

Fast Plagiarism Detection in Large-Scale Data

RDF Validation: A Brief Survey

Bioinformatics and Biological Data Analysis

Frontmatter

Objective Clustering Inductive Technology of Gene Expression Sequences Features

Novel Computational Techniques for Thin-Layer Chromatography (TLC) Profiling and TLC Profile Similarity Scoring

Extending the Doctrine ORM Framework Towards Fuzzy Processing of Data

Segmenting Lungs from Whole-Body CT Scans

Improved Automatic Face Segmentation and Recognition for Applications with Limited Training Data

Emotion Recognition: The Influence of Texture’s Descriptors on Classification Accuracy

Industrial Applications

Frontmatter

The Use of the TGŚP Module as a Database to Identify Breaks in the Work of Mining Machinery

A Data Warehouse as an Indispensable Tool to Determine the Effectiveness of the Use of the Longwall Shearer

Computer Software Supporting Rock Stress State Assessment for Deep Coal Mines

An Ontology Model for Communicating with an Autonomous Mobile Platform

Data Mining Tools, Optimization and Compression

Frontmatter

Relational Transition System in Maude

A Performance Study of Two Inference Algorithms for a Distributed Expert System Shell

DUABI - Business Intelligence Architecture for Dual Perspective Analytics

Comparative Analysis of JavaScript and Its Extensions for Web Application Optimization

ALMM Solver - Database Structure and Data Access Layer Architecture

Human Visual System Inspired Color Space Transform in Lossy JPEG 2000 and JPEG XR Compression

Backmatter

Premium Partner