Skip to main content

2016 | Buch

Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery

12th International Conference, BDAS 2016, Ustroń, Poland, May 31 - June 3, 2016, Proceedings

herausgegeben von: Stanisław Kozielski, Dariusz Mrozek, Paweł Kasprowski, Bożena Małysiak-Mrozek, Daniel Kostrzewa

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 12th International Conference entitled Beyond Databases, Architectures and Structures, BDAS 2016, held in Ustroń, Poland, in May/June 2016.

It consists of 57 carefully reviewed papers selected from 152 submissions. The papers are organized in topical sections, namely artificial intelligence, data mining and knowledge discovery; architectures, structures and algorithms for efficient data processing; data warehousing and OLAP; natural language processing, ontologies and semantic Web; bioinformatics and biomedical data analysis; data processing tools; novel applications of database systems.

Inhaltsverzeichnis

Frontmatter

Invited Papers

Frontmatter
Interactive Visualization of Big Data

Data becomes too big to see. Yet visualization is a central way people understand data. We need to learn new ways to accommodate data visualization that scales up and out for large data to enable people to explore visually their data interactively in real-time as a means to understanding it. The five V’s of big data—value, volume, variety, velocity, and veracity—each highlights the challenges of this endeavor.We present these challenges and a system, Skydive, that we are developing to meet them. Skydive presents an approach that tightly couples a database back-end with a visualization front-end for scaling up and out. We show how hierarchical aggregation can be used to drive this, and the powerful types of interactive visual presentations that can be supported. We are preparing for the day soon when visualization becomes the sixth V of big data.

Parke Godfrey, Jarek Gryz, Piotr Lasek, Nasim Razavi
Big Data Management in the Cloud: Evolution or Crossroad?

In this paper, we try to provide a synthetic and comprehensive state of the art concerning big data management in cloud environments. In this perspective, data management based on parallel and cloud (e.g. MapReduce) systems are overviewed, and compared by relying on meeting software requirements (e.g. data independence, software reuse), high performance, scalability, elasticity, and data availability. With respect to proposed cloud systems, we discuss evolution of their data manipulation languages and we try to learn some lessons should be exploited to ensure the viability of the next generation of large-scale data management systems for big data applications.

Abdelkader Hameurlain, Franck Morvan
Reduction of Readmissions to Hospitals Based on Actionable Knowledge Discovery and Personalization

In this work, we define procedure paths as the sequence of procedures that a given patient undertakes to reach a desired treatment. In addition to its value as a mean to inform the patient of his or her course of treatment, being able to identify and anticipate procedure paths for new patients is an essential task for examining and evaluating the entire course of treatments in advance, and ultimately rectifying undesired procedure paths accordingly. In this paper, we first introduce two approaches for anticipating the state of the patient that he or she will end up in after performing some procedure p; the state of the patient will consequently indicate the following procedure that the patient is most likely to undergo. By clustering patients into subgroups that exhibit similar properties, we improve the predictability of their procedure paths, which we evaluate by calculating the entropy to measure the level of predictability of following procedure. The clustering approach used is essentially a way of personalizing patients according to their properties. The approach used in this work is entirely novel and was designed specifically to address the twofold problem of first being able to predict following procedures for new patients with high accuracy, and secondly being able to construct such groupings in a way that allows us to identify exactly what it means to transition from one cluster to another. Then, we further devise a metric system that will evaluate the level of desirability for procedures along procedure paths, which we would subsequently map to a metric system for the extracted clusters. This will allow us to find desired transitions between patients in clusters, which would result in reducing the number of anticipated readmissions for new patients.

Mamoun Almardini, Ayman Hajja, Zbigniew W. Raś, Lina Clover, David Olaleye, Youngjin Park, Jay Paulson, Yang Xiao
Performing and Visualizing Temporal Analysis of Large Text Data Issued for Open Sources: Past and Future Methods

In this paper we first propose a state of the art on the methods for the visualization and for the interpretation of textual data, and in particular of scientific data. We then shortly present our contributions to this field in the form of original methods for the automatic classification of documents and easy interpretation of their content through characteristic keywords and classes created by our algorithms. In a second step, we focus our analysis on the data evolving over time. We detail our diachronic approach, especially suitable for the detection and for visualization of topic changes. This allows us to conclude with Diachronic’Explorer, our upcoming visualization tool for visual exploration of evolutionary data.

Jean-Charles Lamirel, Nicolas Dugué, Pascal Cuxac

Artifical Intelligence, Data Mining and Knowlege Discovery

Frontmatter
Influence of Outliers Introduction on Predictive Models Quality

The paper presents results of the research related to influence of the level of outliers in the data (train and test data considered separately) on the quality of a model prediction in a classification task. The set of 100 semi–artificial time series was taken into consideration, which independent variables was close to real ones, observed in a underground coal mining environment and dependent variable was generated with the decision tree. For every considered method (decision trees, naive bayes, logistic regression and kNN) a reference model was built (no outliers in the data) which quality was compared with the quality of two models: Out–Out (outliers in train and test data) and Non-out–Out (outliers only in test data). 50 levels of outliers in the data were considered, from 1 % to 50 %. Statistical comparison of models was done on the basis of sign test.

Mateusz Kalisch, Marcin Michalak, Marek Sikora, Łukasz Wróbel, Piotr Przystałka
Mining Rule-Based Knowledge Bases

Rule-based knowledge bases are constantly increasing in volume, thus the knowledge stored as a set of rules is getting progressively more complex and when rules are not organized into any structure, the system is inefficient. In the author’s opinion, modification of both the knowledge base structure and inference algorithms lead to improve the efficiency of the inference process. Rules partition enables reducing significantly the percentage of the knowledge base analysed during the inference process. The form of the group’s representative plays an important role in the efficiency of the inference process. The good performance of this approach is shown through an extensive experimental study carried out on a collection of real knoswledge bases.

Agnieszka Nowak-Brzezińska
Two Methods of Combining Classifiers, Which are Based on Decision Templates and Theory of Evidence, in a Dispersed Decision-Making System

Issues that are related to decision making that is based on dispersed knowledge are discussed in the paper. The main aim of the paper is to compare the results obtained using two different methods of conflict analysis in a dispersed decision-making system. The conflict analysis methods, used in the article, are discussed in the paper of Kuncheva et al. [5] and in the paper of Rogova [16]. These methods are used if the individual classifiers generate vectors that represent the probability distributions over different decision. Both methods belong to the class-indifferent group, i.e. methods that use all of decision profile matrices to calculate the support for each class. Also, both methods require training. These methods were used in a dispersed decision-making system which was proposed in the paper [12].

Małgorzata Przybyła-Kasperek
Methods for Selecting Nodes for Maximal Spread of Influence in Recommendation Services

Social network analysis is a tool to assess social interactions between people e.g. in the Internet. One of the most active areas in this field are modeling influence of users and finding influential users. These areas have many applications, e.g., in marketing, business or politics. Several models of influence have been described in literature, but there is no single model that best describes the process of spreading entities (e.g. information, behaviour) through the network. Interesting and practical problem is how to choose a small number of users that will guarantee maximal spread of entities over the whole network (influence maximization problem). In this paper we studied this problem using various centrality metrics with different models of influence propagation. Experiments were conducted on three, real-world datasets regarding the domain of recommendation services.

Bogdan Gliwa, Anna Zygmunt
Memetic Neuro-Fuzzy System with Differential Optimisation

Neuro-fuzzy systems are capable of tuning theirs parameters on presented data. Both global and local techniques can be used. The paper presents a hybrid memetic approach where local (gradient descent) and global (differential evolution) approach are combined to tune parameters of a neuro-fuzzy system. Application of the memetic approach results in lower error rates than either gradient descent optimisation or differential evolution alone. The results of experiments on benchmark datasets have been statistically verified.

Krzysztof Siminski
New Rough-Neuro-Fuzzy Approach for Regression Task in Incomplete Data

A fuzzy rule base is a crucial part of neuro-fuzzy systems. Data items presented to a neuro-fuzzy system activate rules in a rule base. For incomplete data the firing strength of the rules cannot be calculated. Some neuro-fuzzy systems impute the missing firing strength. This approach has been successfully applied. Unfortunately in some cases the imputed firing strength values are very low for all rules and data items are poorly recognized by the system. That may deteriorate the quality and reliability of elaborated results.The paper presents a new method for handling missing values in neuro-fuzzy systems in a regression task. The new approach introduces a new imputation technique (imputation with group centres) to avoid very low firing strength for incomplete data items. It outperforms previous method (elaborates lower error rates), avoids numerical problems with very low firing strengths in all fuzzy rules of the system. The proposed systems elaborated interval answer without Karnik-Mendel algorithm. The paper is accompanied by numerical examples and statistical verification on real life data sets.

Krzysztof Siminski
Improvement of Precision of Neuro-Fuzzy System by Increase of Activation of Rules

Neuro-fuzzy systems have proved to be a powerful tool for data approximation and generalization. A rule base is a crucial part of a neuro-fuzzy system. The data items activate the rules and their answers are aggregated into a final answer. The experiments reveal that sometimes the activation of all rules in a rule base is very low. It means the system recognizes the data items very poorly. The paper presents a modification of the neuro-fuzzy system: the tuning procedure has two objectives: minimizing of the error of the system and maximizing of the activation of rules. The higher activation (better recognition of the data items) makes the model more reliable. The increase of the activation of rules may also decrease the error rate for the model. The paper is accompanied by the numerical examples.

Krzysztof Siminski
Rough Sets in Multicriteria Classification of National Heritage Monuments

The motivation of this paper are problems how to improve assessment of historic buildings, in terms of the significance of conservation activities. The protection of national heritage, so important nowadays, requires a multicriteria assessment. Seeing that different factors in varying degrees affect the comprehensive assessment of the object, it becomes necessary to use computational intelligence methods. This paper presents a rough sets approach to multicriteria rating of objects on the example of historical buildings.

Krzysztof Czajkowski

Architectures, Structures and Algorithms for Efficient Data Processing

Frontmatter
Inference Rules for Fuzzy Functional Dependencies in Possibilistic Databases

We consider fuzzy functional dependencies (FFDs) which can exist between attributes in possibilistic databases. The degree of FFD is evaluated by two numbers from the unit interval which correspond to possibility and necessity measures. The notion of FFD is defined with the use of the extended Gödel implication operator. For such dependencies we present inference rules as a fuzzy extension of Armstrong’s axioms. We show that they form a sound and complete system.

Krzysztof Myszkorowski
The Evaluation of Map-Reduce Join Algorithms

In recent years, Map-Reduce systems have grown into leading solution for processing large volumes of data. Often, in order to minimize the execution time, the developers express their programs using procedural language instead of high-level query language. In such cases one has full control over the program execution, what can lead to several problems, especially when join operation is concerned. In the literature the wide range of join techniques has been proposed, although many of them cannot be easily classified using old Map-Side/Reduce-Side distinction. The main goal of this paper is to propose the taxonomy of the existing join algorithms and provide their evaluation.

Maciej Penar, Artur Wilczek
The Design of the Efficient Theta-Join in Map-Reduce Environment

When analysing the data, the user often may want to perform the join between the input data sources. At first glance, in Map-Reduce programming model, the developer is limited only to equi-joins as they can be easily implemented using the grouping operation. However, some techniques have been developed to leverage the joins using non-equality conditions. In this paper, we propose the enhancement to cross-join based algorithms, like Strict-Even Join, by handling the equality and non-equality conditions separately.

Maciej Penar, Artur Wilczek
Non-recursive Approach for Sort-Merge Join Operation

Several algorithms have been developed over the years to perform join operation which is executed frequently and affects the efficiency of the database system. Some of these efforts prove that join performance mainly depends on the sequences of execution of relations in addition to the hardware architecture. In this paper, we present a method that processes a many-to-many multi join operation by using a non-recursive reverse polish notation tree for sort-merge join. Precisely, this paper sheds more light on main memory join operation of two types of sort-merge join sequences: sequential join sequences (linear tree) and general join sequences (wide bushy tree, also known as composite inner) and also tests their performance and functionality. We will also provide the algorithm of the proposed system that shows the implementation steps.

Norah Asiri, Rasha Alsulim
Estimating Costs of Materialization Methods for SQL:1999 Recursive Queries

Although querying hierarchies and networks is one of common tasks in numerous business application, the SQL standard has not acquired appropriate features until its 1999 edition. Furthermore, neither relational algebra nor calculus offer them. Since the announcement of the abovementioned standard, various database vendors introduced SQL:1999 recursive queries into their products. Yet, there are popular database management systems that do not support such recursion. MySQL is probably the most profound example. If the DBMS used is contacted via an object-relational mapper (ORM), there is a possibility to offer recursive queries provided by this middleware layer. Moreover, data structures materialized in the DBMS can be used to accelerate such queries. In prequel papers, we have presented a product line of features that eventually allow MySQL users to run SQL:1999 recursive queries via ORM. They were: (1) appropriate ORM programmer interfaces, (2) optimization methods of recursive queries, and (3) methods to build materialized data structures that accelerate recursive queries. We have indicated four such methods, i.e.: full paths, logarithmic paths, materialized paths and logarithmic paths. In this paper we aim to assist a database/system architect in the choice of the optimal solutions for the expected workload. We have performed exhaustive experiments to build a cost model for each of the solutions. Their results have been analyzed to build empirical formulae of the cost model. Using this formulae and estimated properties of the expected workload, the database architect or administrator can choose the best materialization method for his/her application.

Aleksandra Boniewicz, Piotr Wiśniewski, Krzysztof Stencel
Performance Aspect of the In-Memory Databases Accessed via JDBC

The conception of storing and managing data directly in RAM appeared some time ago but in spite of very good efficiency, it was impossible to massive implementation because of hardware limitations. Currently, it is possible to store whole databases in memory as well as there are some mechanisms to organize pieces of data as in-memory databases. It has been the interesting issue how this type of databases behaves when accessing via JDBC. Hence we decided to test their performance in terms/sense of the time of SQL query execution. For this purpose TPC Benchmark$$^{\mathrm {TM}}$$ H (TPC-H) was applied. In our research we focused on the open source systems such as Altibase, H2, HyperSQL, MariaDB, MySQL Memory.

Daniel Kostrzewa, Małgorzata Bach, Robert Brzeski, Aleksandra Werner
Comparison of the Behaviour of Local Databases and Databases Located in the Cloud

This article is dedicated to analysis and comparison of the behaviour of databases located in the cloud with databases located in the local infrastructure. Analysis presents stability of results delivery speed. Article’s summary includes suggestions for utilization of particular solutions in implementation of specific types of database based applications. There will also be established border of the profitability for database migration from the local environment to the cloud. This article is aimed at specialists working on the design of IT projects as well as scientists who want to consider to utilize cloud based solutions for storage of information e.g. RNA polynucleotides sequences.

Marcin Szczyrbowski, Dariusz Myszor
Scalable Distributed Two-Layer Datastore Providing Data Anonymity

Storing data in a public data systems (mostly in the Cloud) can lead to many considerations about the data privacy. Are our data completely safe? Inspired by these considerations the author started to develop efficient framework which can be used to improve data privacy while storing them in public data storages. The Scalable Distributed Two-Layer Datastore was used as a base for the framework because it proved to be very efficient solution to store huge data sets.

Adam Krechowicz
Coordination of Parallel Tasks in Access to Resource Groups by Adaptive Conflictless Scheduling

Conflictless task scheduling is dedicated for environment of parallel task processing with high contention of limited amount of resources. For tasks that each one requires group of resources presented solution can prepare schedule of tasks execution without occurrence of any resource conflict. As a task it can be used any selected sequence of operation that for execution requires access for resource group, to which access is controlled by conflictless scheduling. Any resource group required by task has own FIFO queue, where tasks are waiting for access of those resources. Queues are emptying according to prepared conflictless schedule in such a way that there is no starvation of waiting tasks. Presented scheduling concept for tasks and resource group bases on resource representation model which allows to efficient detect a resource conflict using dedicated data structures like task classes and conflict matrix and algorithms which allows to prepare adaptive conflictless schedule. Prepared conflictless schedule adapts to current environment state like number of resource groups and tasks in their queues and also waiting times of tasks. Prepared schedule ensures task execution without resource conflicts and therefore there is no tasks deadlock. As example of environments where conflictless scheduling can be applied is transaction processing in databases or OLTP systems, processes or threads competing for resources. In transaction processing environment deadlock elimination by using proposed conflictless scheduling reduces the number of transaction rollbacks.

Mateusz Smolinski
Conflictless Task Scheduling Using Association Rules

The proposed rules of conflictless task scheduling is based on binary representation of tasks. Binary identifiers promote the process of rapid detection of conflicts between tasks. The article presents the concept of conflictless tasks scheduling using one of the data mining methods, namely association rules.

Agnieszka Duraj
Distributed Computing in Monotone Topological Spaces

In recent times, an alternate approach to model and analyze distributed computing systems has gained research attention. The alternate approach considers higher-dimensional topological spaces and homotopy as well as homology while modeling and analyzing asynchronous distributed computing. This paper proposes that the monotone spaces having ending property can be effectively employed to model and analyze consistency and convergence of distributed computing. A set of definitions and analytical properties are constructed considering monotone spaces. The inter-space relationship between simplexes and monotone in topological spaces is formulated.

Susmit Bagchi

Data Warehousing and OLAP

Frontmatter
AScale: Auto-Scale in and out ETL+Q Framework

The purpose of this study is to investigate the problem of providing automatic scalability and data freshness to data warehouses, while simultaneously dealing with high-rate data efficiently. In general, data freshness is not guaranteed in these contexts, since data loading, transformation and integration are heavy tasks that are performed only periodically.Desirably, users developing data warehouses need to concentrate solely on the conceptual and logic design such as business driven requirements, logical warehouse schemas, workload and ETL process, while physical details, including mechanisms for scalability, freshness and integration of high-rate data, should be left for automated tools.In this regard, we propose a universal data warehouse parallelization system, that is, an approach to enable the automatic scalability and freshness of warehouses and ETL processes. A general framework for testing and implementing the proposed system was developed. The results show that the proposed system is capable of handling scalability to provide the desired processing speed and data freshness.

Pedro Martins, Maryam Abbasi, Pedro Furtado
AScale: Big/Small Data ETL and Real-Time Data Freshness

In this paper we investigate the problem of providing timely results for the Extraction, Transformation and Load (ETL) process and automatic scalability to the entire pipeline including the data warehouse. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during specific offline time windows. Parallel architectures and mechanisms are able to optimize the ETL process by speeding-up each part of the pipeline process as more performance is needed. However, none of them allow the user to specify the ETL time and the framework scales automatically to assure it.We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL process in time, suitable for smallData and bigData scenarios. A general framework for testing and implementing the system was developed to provide solutions for each part of the ETL automatic scalability in time. The results show that the proposed system is capable of handling scalability to provide the desired processing speed for both near-real-time results ETL processing.

Pedro Martins, Maryam Abbasi, Pedro Furtado
New Similarity Measure for Spatio-Temporal OLAP Queries

Storing, querying, and analyzing spatio-temporal data are becoming increasingly important, as the availability of volumes of spatio-temporal data increases. One important class of spatio-temporal analysis is computing spatio-temporal queries similarity. In this paper, we focus on assessing the similarity between Spatio-Temporal OLAP queries in term of their GeoMDX queries. However, the problem of measuring Spatio-Temporal OLAP queries similarity has not been studied so far.Therefore, we aim at filling this gap by proposing a novel similarity measure. The proposed measure can be used either in developing query recommendation, personalization systems or speeding-up query evolution. It takes into account the temporal similarity and the basic components of spatial similarity assessment relationships.

Olfa Layouni, Jalel Akaichi

Natural Language Processing, Ontologies and Semantic Web

Frontmatter
Enhancing Concept Extraction from Polish Texts with Rule Management

This paper presents a system for extraction of concepts from unstructured Polish texts. Here concepts are understood as n-grams, whose words satisfy specific grammatical constraints. Detection and transformation of concepts to their normalized form are performed with rules defined in a language, which combines elements of colored and fuzzy Petri nets. We apply a user friendly method for specification of samples of transformation patterns that are further compiled to rules. To improve accuracy and performance, we recently introduced rule management mechanisms, which are based on two relations between rules: partial refinement and covering. The implemented methods include filtering with metarules and removal of redundant rules (i.e. these covered by other rules). We report results of experiments, which aimed at extracting specific concepts (actions) using a ruleset refactored with the developed rule management techniques.

Piotr Szwed
Mapping of Selected Synsets to Semantic Features

In the paper we devise a novel algorithm related to the area of natural language processing. The algorithm is capable of building a mapping between the sets of semantic features and the words available in semantic dictionaries called wordnets. In our research we consider wordnets as ontologies, paying particular attention to hypernymy relation. The correctness of the proposal is verified experimentally based on a selected set of semantic features. plWordNet semantic dictionary is considered as a reference source, providing required information for the mapping. The algorithm is evaluated on an instance of a decision problem related to data classification. The quality measures of the classification include: false positive rate, false negative rate and accuracy. A measure of a strength of membership (SOM) in a semantic feature class is proposed and its impact on the aforementioned quality measures is evaluated.

Tomasz Jastrząb, Grzegorz Kwiatkowski, Paweł Sadowski
A Diversified Classification Committee for Recognition of Innovative Internet Domains

The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.

Marcin Mirończuk, Jarosław Protasiewicz
The Onto-CropBase – A Semantic Web Application for Querying Crops Linked-Data

The lack of formal technical knowledge has been identified as one of the constraints to research and development on a group of crops collectively referred to as underutilized crops. Some information about these crops is available in informal sources on the web, for example on Wikipedia. However, this knowledge is not entirely authoritative, it may be incomplete and/or inconsistent and for these reasons it is not suitable as a basis of decision making by crop growers. Alternatively, we present an ontology-driven tool as a web-based access point for underutilized-crops ontology model. Major discussion points in this paper highlight our design choices, the tool implementation and a preliminary validation of the tool with brief discussion of related developments.

Abba Lawan, Abdur Rakib, Natasha Alechina, Asha Karunaratne
TripleID: A Low-Overhead Representation and Querying Using GPU for Large RDFs

Resource Description Framework (RDF) is a commonly used format for semantic web processing. It basically contains strings representing terms and their relationships which can be queried or inferred. RDF is usually a large text file which contains many million relationships. In this work, we propose a framework, TripleID, for processing queries of large RDF data. The framework utilises Graphics Processing Units (GPUs) to search RDF relations. The RDF data is first transformed to the encoded form suitable for storing in the GPU memory. Then parallel threads on the GPU search the required data. We show in the experiments that one GPU on a personal desktop can handle 100 million triple relations, while a traditional RDF processing tool can process up to 10 million triples. Furthermore, we can query sample relations within 0.18 s with the GPU in 7 million triples, while the traditional tool takes at least 6 s for 1.8 million triples.

Chantana Chantrapornchai, Chidchanok Choksuchat, Michael Haidl, Sergei Gorlatch

Bioinformatics and Biomedical Data Analysis

Frontmatter
eQuant - A Server for Fast Protein Model Quality Assessment by Integrating High-Dimensional Data and Machine Learning

In molecular biology, reliable protein structure models are essential in order to understand the functional role of proteins as well as diseases related to them. Structures are derived by complex and resource-demanding experiments, whereas in silico structure modeling and refinement approaches are established to cope with experimental limitations. Nevertheless, both experimental and computational methods are prone to errors. In consequence, small local regions or even the whole tertiary structure can be unreliable or erroneous, leading the researcher to formulate false hypotheses and draw false conclusions.Here, we present eQuant, a novel and fast model quality assessment program (MQAP) and server. By utilizing a hybrid approach of established MQAPs in combination with machine learning techniques, eQuant achieves more homogeneous assessments with less uncertainty compared to other established MQAPs. For normal sized protein structures, computation requires less than ten seconds, making eQuant one of the fastest MQAPs available. The eQuant server is freely available at https://biosciences.hs-mittweida.de/equant/.

Sebastian Bittrich, Florian Heinke, Dirk Labudde
Evaluation of Descriptor Algorithms of Biological Sequences and Distance Measures for the Intelligent Cluster Index (ICIx)

In hindsight of the previous decades, a rapid growth of data in all fields of life sciences is perceptible. Most notably is the general tendency of retaining well established techniques regarding specific biological requirements and common taxonomies for data classification. Therefore a change in perspective towards advanced technological concepts for persisting, organizing and analyzing these huge amounts of data is essential. The Intelligent Cluster Index (ICIx) is a modern technology capable of indexing multidimensional data through semantic criteria, qualified for this challenge. In this paper methodical approaches for indexing biological sequences with the ICIx are discussed and evaluated. This includes the examination of established methods concentrating on vector transformation as well as outlining the efficiency of different distance measures applied to these vectors. Based on our results, it becomes apparent that position conserving methods are superior to other approaches and that the applied distance measures heavily influence performance and quality.

Stefan Schildbach, Florian Heinke, Wolfgang Benn, Dirk Labudde
A Holistic Approach to Testing Biomedical Hypotheses and Analysis of Biomedical Data

Testing biomedical hypotheses is performed based on advanced and usually many-step analysis of biomedical data. This requires sophisticated analytical methods and data structures that allow to store intermediate results, which are needed in the subsequent steps. However, biomedical data, especially reference data, often change in time and new analytical methods are created every year. This causes the necessity to repeat the iterative analyses with new methods and new reference data sets, which in turn causes frequent changes of the underlying data structures. Such instability of data structures can be mitigated by the use of the idea of data lake, instead of traditional database systems.The aim of this paper is to show system for researchers dealing with various types of biomedical data. Such a system provides a functionality of data analysis and testing different biomedical hypotheses. We treat a problem in a holistic way giving a researcher freedom in configuration his own multi-step analysis. This is possible by using a multiversion dynamic-schema data warehouse, performing parallel calculations on the virtualized computational environment, and delivering data in MapReduce-based ETL processes.

Krzysztof Psiuk-Maksymowicz, Aleksander Płaczek, Roman Jaksik, Sebastian Student, Damian Borys, Dariusz Mrozek, Krzysztof Fujarewicz, Andrzej Świerniak
Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

Selection of informative features out of ever growing results of high throughput biological experiments requires specialized feature selection algorithms. One of such methods is the Monte Carlo Feature Selection - a straightforward, yet computationally expensive one. In this technical paper we present architecture and performance of a development version of our distributed implementation of this algorithm, designed to run in multiprocessor as well as multihost computing environments, and potentially controllable through a web browser by non-IT staff. As a simple enhancement, our method is able to produce statistically interpretable output by means of permutation testing. Tested on reference Golub et al. leukemia data, as well as on our own dataset of almost 2 million features, it has shown nearly linear speedup when executed with an increased amount of processors. Being platform independent, as well as open for extensions, this application could become a valuable tool for researchers facing the challenge of ill-defined high dimensional feature selection problems.

Lukasz Krol
Architectural Challenges of Genotype-Phenotype Data Management

Medical research initiatives more and more often involve processing considerable amounts of data that may evolve during the project. These data should be preserved and aggregated for the purpose of future analyses beyond the lifetime of a given research project. This paper discussed the challenges concerned with the construction of the storage management layer for genotype-phenotype data. These data were used to research neurodegeneration disorders and their therapy. We outline the functionality of data processing services. We also present a flexible data-storage structure. Finally, we discuss the choices regarding database schema management and input sanitation and processing.

Michał Chlebiej, Piotr Habela, Andrzej Rutkowski, Iwona Szulc, Piotr Wiśniewski, Krzysztof Stencel
Appling of Neural Networks to Classification of Brain-Computer Interface Data

The paper presents application of neural networks to the construction of a brain-computer interface (BCI) based on the Motor Imagery paradigm. The BCI was constructed for ten electroencephalographic (EEG) signals collected and analysed in real time.The filtered signals were divided into three groups corresponding to the information displayed to users on the screen during the experiments. ANOVA analysis and automatic construction of a neural network (NN) classification were also performed. Results of the ANOVA analysis were confirmed by the neural networks efficiency analysis. The efficiency of NN classification of the left and right hemisphere activities reached almost 70 %.

Malgorzata Plechawska-Wojcik, Piotr Wolszczak

Data Processing Tools

Frontmatter
Content Modelling in Radiological Social Network Collaboration

We present in this paper a model representation of a report extracted from a radiological collaborative social network, which combines textual and visual descriptors. The text and the medical image, which compose a report, are each described by a vector of TF-IDF weights following an approach “bag-of-words”. The model used, allows for multimodal queries to research medical information. Our model is evaluated on the basis imageCLEFMed’ 2015 for which we have the ground truth. Many experiments were conducted with various descriptors and many combinations of modalities. Analysis of the results shows that the model, which is based on two modalities allows to increase the performance of a search system based on only one modality, that it be textual or visual.

Riadh Bouslimi, Mouhamed Gaith Ayadi, Jalel Akaichi
Features of SQL Databases for Multi-tenant Applications Based on Oracle DBMS

The paper presents several architectural aspects of data layer for developing of multi-tenant applications. Multi-tenancy is a term describing the application model of services delivering in which a single or many instances of a software run on a server and serves multiple tenants. This feature has an impact on several nonfunctional aspects of system such as security, availability, backup, recovery and more. This article clarifies the specificity of these aspects in approach to multi-tenant applications building and points how different data layer architecures address them. On an example of Oracle database serveral build-in features and concepts that could be helpful to increase the quality of mensioned non-functional aspects were discussed.

Lukasz Wycislik
A New Big Data Framework for Customer Opinions Polarity Extraction

Recently, we are talking about opinion mining: It refers to extract subjective information from text data using the natural language processing, text analysis and computational linguistics. Micro-blogging is one of the most popular Web 2.0 applications, such as Twitter which is evolved into a practical means for sharing opinions around different topics. It becomes a rich data sources for opinion mining and sentiment analysis.In this work, we interest by to study users opinions about an object in social networks, for example studying the opinion of users about “the Samsung brand” or “the nokia brand”, using text mining and NLP (Natural language processing) technologies. We propose a new ontological approach able to determinate the polarity of user post. This approach classify the users posts to negative, positive or neutral opinions. To validate the effectiveness of our approach, we used a dataset published by Bing Liu’s group in our approach experimentation.

Ammar Mars, Mohamed Salah Gouider, Lamjed Ben Saïd
Evidence Based Conflict Resolution for Independent Sources and Independent Attributes

The gigantic use of digital information has changed comprehensively the way we live. People rely more and more on information collected from various sources in every aspect of life. However, due to the natural variety and autonomy of these sources, finding relevant and accurate information is becoming increasingly difficult. Indeed, several sources can provide different conflicting facts for the same real-world object. Moreover, most modern-day applications often provide imperfect information. Therefore, it is strenuous to distinguish the true facts from the false ones. To deal with this problem, we propose in this paper a new evidential conflict resolution method for independent sources and independent attributes. Our method exploits the power of Dempster-Shafer theory so as to find the most trustable facts when data sources provide imperfect information.

Walid Cherifi, Bolesław Szafrański
A New MGlaber Approach as an Example of Novel Artificial Acari Optimization

The proposed MGlaber method is based on observation of the behavior of mites called Macrocheles glaber (Muller, 1860). It opens the series of optimization methods inspired by the behavior of mites, which we have given a common name: Artificial Acari Optimization. Acarologists observed three stages the ovoviviparity process consists of, i.e.: preoviposition behaviour, oviposition behaviour (which is followed by holding an egg below the gnathosoma) and hatching of the larva supported by the female. It seems that the ovoviviparity phenomenon in this species is favoured by two factors, i.e.: poor feeding and poor quality of substrate. Experimental tests on a genetic algorithm were carried out. The MGlaber method was worked into a genetic algorithm by replacing crossig and mutation methods. The obtained results indicate to significant increase in the algorithm convergence without side-effects in the form of stopping of evolution at local extremes. The experiment was carried out one hundred times on random starting populations. No significant deviations of the measured results were observed. The research demonstrated significant increase in the algorithm operation speed. Convergence of evolution has increased about ten times. It should be noted here that MGlaber method was not only or even not primarily created for genetic algorithms. The authors perceive large potential for its application in all optimization methods where the decision about further future of the solutions is taken as a result of the evaluation of the objective function value. Therefore the authors treat this paper as the beginning of a cycle on Artificial Acari Optimization, which will include a series of methods inspired by behaviour of different species of mites.

Jacek M. Czerniak, Dawid Ewald
Physical Knowledge Base Representation for Web Expert System Shell

Web applications have developed rapidly and have had a significant impact on the application of systems in many domains. The migration of information systems from classic desktop software to web applications can be seen as a permanent trend. This trend also applies to the knowledge based systems. This work is a part of the KBExplorator project – the main goal of this project is to provide a complete and easy to use web-based tool for the development of expert systems. The evaluation of the rules searching effectiveness in the proposed physical rule base model is the first experimental aim of this work. Experiments will be conducted to determine the duration of retrieving a single rule or group of rules in large rules sets. Decomposition of the rule knowledge base into the relational database is also a crucial issue of this work and therefore the presentation of the data model is the second goal of this work. The usage of a relational database in the web-based application is obvious, but its usage as the physical storage for the rule base is described in relatively small number of publications. Proposed decomposition conception and the model presented in this work has not been previously described. The positive results of experiments presented in this work allow us to continue the development of the system – in the next revision, the database interface layer will be implemented with the usage of a specialized API. This proposed software architecture allow us to transparently change the database engine as well as the programming language currently used in the application layer of the system.

Roman Simiński, Tomasz Xiȩski
OSA Architecture

In this paper we present an in depth discussion of the architecture of a new plagiarism detection platform developed by a consortium of Polish universities. The algorithms used by the platform are briefly described in Sect. 3. The main goal of this paper is to present high level structures of services resulting from a very nontrivial attempt to strike an appropriate balance between locality and centralization, while working under strict constraint, both of technological and legal nature.

Ścibór Sobieski, Marek A. Kowalski, Piotr Kruszyński, Maciej Sysak, Bartosz Zieliński, Paweł Maślanka
An Investigation of Face and Fingerprint Feature-Fusion Guidelines

There are a lack of multi-modal biometric fusion guidelines at the feature-level. This paper investigates face and fingerprint features in the form of their strengths and weaknesses. This serves as a set of guidelines to authors that are planning face and fingerprint feature-fusion applications or aim to extend this into a general framework. The proposed guidelines were applied to the face and fingerprint to achieve a 91.11 % recognition accuracy when using only a single training sample. Furthermore, an accuracy of 99.69 % was achieved when using five training samples.

Dane Brown, Karen Bradshaw
GISB: A Benchmark for Geographic Map Information Extraction

The growing number of different models and approaches for Geographic Information Systems (GIS) brings high complexity when we want to develop new approaches and compare a new GIS algorithm. In order to test and compare different processing models and approaches, in a simple way, we identified the need of defining uniform testing methods, able to compare processing algorithms in terms of performance and accuracy regarding: large imaging processing, algorithms for GIS pattern-detection.Taking into account, for instance, images collected during a drone flight or a satellite, it is important to know the processing cost to extract data when applying different processing models and approaches, as well as their accuracy (compare execution time vs. extracted data quality). In this work we propose a GIS Benchmark (GISB), a benchmark that allows to evaluate different approaches to detect/extract selected features from a GIS data-set. Considering a given data-set (or two data-sets, from different years, of the same region) it provides linear methods to compare different performance parameters regarding GIS information, making possible to access the most relevant information in terms of features and processing efficiency.

Pedro Martins, José Cecílio, Maryam Abbasi, Pedro Furtado
SRsim: A Simulator for SSD-Based RAID

RAID is a popular storage architecture devised to improve both I/O performance and reliability of disk storages with disk arrays. With the declining price of NAND flash-based solid state drives and their performance gains, they have been widely deployed in systems ranging from portables to Internet-scale enterprise systems. By virtue of the benefits, studies on developing RAID gears with solid state drives have been performed in recent years. However, there are still some research issues on realizing a reliable RAID system with arrays of solid state drives due to the different characteristics of NAND flash memory. Moreover, the internal S/W architecture of the current commercial SSDs are not opened to the public so that it is hard to test SSD-based RAID systems with their devised algorithms. The fundamental algorithms in DBMS have been optimized for the use of hard disk drives, which prefer sequential data accesses. Recently, DBMS internals have been modified to make the best use of solid state drives rather than simple hardware replacement. To help the studies, we propose an open-source SSD-based RAID simulator named SRsim. SRSim helps researchers explore and experiment their ideas on an array os solid state drives more easily and accurately.

HooYoung Ahn, YoonJoon Lee, Kyong-Ha Lee

Novel Applications of Database Systems

Frontmatter
Application of Reversible Denoising and Lifting Steps to LDgEb and RCT Color Space Transforms for Improved Lossless Compression

The lifting step of a reversible color space transform employed during image compression may increase the total amount of noise that has to be encoded. Previously, to alleviate this problem in the case of a simple color space transform RDgDb, we replaced transform lifting steps with reversible denoising and lifting steps (RDLS), which are lifting steps integrated with denoising filters. In this study, we apply RDLS to more complex color space transforms LDgEb and RCT and evaluate RDLS effects on bitrates of lossless JPEG-LS, JPEG 2000, and JPEG XR coding for a diverse image test-set. We find that RDLS effects differ among transforms, yet are similar for different algorithms; for the employed denoising filter selection method, on average the bitrate improvements of RDLS-modified LDgEb and RCT are not as high as of the simpler transform. The RDLS applicability reaches beyond image data storage; due to its general nature it may be exploited in other lifting-based transforms, e.g., during the image analysis for data mining.

Roman Starosolski
Daily Urban Water Demand Forecasting - Comparative Study

There are many existing, general purpose models for the forecasting of time series. However, until now, only a small number of experimental studies exist whose goal is to select the forecasting model for a daily, urban water demand series. Moreover, most of the existing studies assume off-line access to data. In this study, we are confronted with the task to select the best forecasting model for the given water demand time series gathered from the water distribution system of Sosnowiec, Poland. In comparison to the existing works, we assume on-line availability of water demand data. Such assumption enables day-by-day retraining of the predictive model. To select the best individual approach, a systematic comparison of numerous state-of-the-art predictive models is presented. For the first time in this paper, we evaluate the approach of averaging forecasts with respect to the on-line available daily water demand time series. In addition, we analyze the influence of missing data, outliers, and external variables on the accuracy of forecasting. The results of experiments provide evidence that the average forecasts outperform all considered individual models, however, the selection of the models used for averaging is not trivial and must be carefully done. The source code of the preformed experiments is available upon request.

Wojciech Froelich
Database Index Debug Techniques: A Case Study

The index corruption may lead to serious problems ranging from the temporary system outage to the loss of sensitive data. In this article we discuss the techniques that we found helpful in assuring the data index consistency during the development of specific indexing algorithms for a multidimensional BI system featuring both OLAP and OLTP aspects. The use of the techniques described in this article from the very beginning of the project development helped to save sufficient resources during the development and debugging.

Andrey Borodin, Sergey Mirvoda, Sergey Porshnev
AI Implementation in Military Combat Identification – A Practical Solution

This paper presents the architecture of a communication system which was implemented in MiG-29 airplanes. This system provides a continuous on-line access to the situational awareness information which is necessary for the pilot. The interoperability of this system with other NATO systems allows to collect and transfer data between them. Artificial Intelligence methods are used to implement and improve this system. This modification enables the system to work faster and increases the situational awareness of the pilot on the battlefield.

Łukasz Apiecionek, Wojciech Makowski, Mariusz Woźniak
Persistence Management in Digital Document Repository

The CREDO Digital Document Repository enables short-and long-term archiving of large volumes of digital resources, ensuring bitstream preservation and providing most of the technical means to ensure content preservation of digital resources. The goal of the paper is to describe the design and implementation an innovative component of the CREDO Repository: the Persistence Management Subsystem (PMS). This subsystem sets guidelines for the file management system on replicas placement, and data relocation. The module responsible for scheduling access to the archive provides energy efficiency by setting suboptimal schedules. The module responsible for diagnose and exchange of data carriers calculates the probabilities of failure, and the information is used by the scheduling module to select appropriate storage areas for reading or writing of data, and for marking the areas as obsolete. Finally, the power management module is responsible for starting-up the storage areas only when necessary.

Piotr Pałka, Tomasz Śliwiński, Tomasz Traczyk, Włodzimierz Ogryczak
Intelligent FTBint Method for Server Resources Protection

The subject of this article is the issue of security of network resources in computer networks. One of the main problems of computer networks are Distributed Denial of Service attacks, which can take all server resources and block them. The FTBint intelligent method can manage the amount of network traffic passed to a server and help the server to work during the attack. After the attack is recognized the number of connections provided to the server can be changed in time in an intelligent way. Such solution gives time to the server to dispose of the resources which were allocated incorrectly by the attacker. This new concept is different from the one used in the currently existing methods, as it enables the user to finish his work which had been started before the attack occured. Such user does not suffer from DDoS attacks when the FTBint method is used. The proposed method has already been tested.

Łukasz Apiecionek, Wojciech Makowski
Lexicon-Based System for Drug Abuse Entity Extraction from Twitter

Drug abuse and addiction is a serious healthcare problem and social phenomenon that has not received the interest deserved in scientific research due to the lack of information. Today, social media have become an ubiquitous source of information in this field since they are the environment on which addicted individuals rely to talk about their dependencies. However, extracting salient information from social media is a difficult task regarding their noisy, dynamic and unstructured character. In addition, natural language processing tools (NLP) are not conceived to manage social data and cannot extract semantic and domain-specific entities.In this paper, we propose a framework for real time collection and analysis of Twitter data which heart is a personalized NLP process for the extraction of drug abuse information. We extend Stanford CoreNLP pipeline with a customized annotator based on fuzzy matching with drug abuse and addiction lexicons in a dictionary. Our system, ran on 86 041 tweets, achieved 82 % of accuracy.

Ferdaous Jenhani, Mohamed Salah Gouider, Lamjed Ben Said
Manifold Learning for Hand Pose Recognition: Evaluation Framework

Hand pose recognition from 2D still images is an important, yet very challenging problem of data analysis and pattern recognition. Among many approaches proposed, there have been some attempts to exploit manifold learning for recovering intrinsic hand pose features from the hand appearance. Although they were reported successful in solving particular problems related with recognizing a hand pose, there is a lack of a thorough study on how well these methods discover the intrinsic hand dimensionality. In this study, we introduce an evaluation framework to assess several state-of-the-art methods for manifold learning and we report the results obtained for a set of artificial images generated from a hand model. This will help in future deployments of manifold learning to hand pose estimation, but also to other multidimensional problems common to the big data scenarios.

Maciej Papiez, Michal Kawulok, Jakub Nalepa
A Meta-Learning Approach to Methane Concentration Value Prediction

A meta-learning approach to stream data analysis is presented in this work. The analysis is based on prediction of methane concentration in a coal mine. The results of the analysis show that the chosen approach achieves relatively low error values. Additionally, the impact of a data window size on a learning speed and quality was verified. The analysis is performed on a stream of measurements that was generated on a basis of real values collected in a coal mine.

Michał Kozielski
Anomaly Detection in Data Streams: The Petrol Station Simulator

Developing anomaly detection systems requires diverse data for training and testing purposes. Real measurements are not necessarily reliable at this stage because it is almost impossible to find a diverse training set with exactly known characteristics. The petrol station simulator was designed to generate measurements that mimic real petrol station readings. The simulator produces datasets with exactly specified anomalies to be detected via anomaly detection system. The paper introduces foundations of the simulator with results. The discussion section presents future work in the area of stream data extraction and materialization in the Stream Data Warehouse.

Anna Gorawska, Krzysztof Pasterak
Backmatter
Metadaten
Titel
Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery
herausgegeben von
Stanisław Kozielski
Dariusz Mrozek
Paweł Kasprowski
Bożena Małysiak-Mrozek
Daniel Kostrzewa
Copyright-Jahr
2016
Electronic ISBN
978-3-319-34099-9
Print ISBN
978-3-319-34098-2
DOI
https://doi.org/10.1007/978-3-319-34099-9

Premium Partner