nach oben

2018 | Buch

Data Analytics and Management in Data Intensive Domains

XIX International Conference, DAMDID/RCDL 2017, Moscow, Russia, October 10–13, 2017, Revised Selected Papers

herausgegeben von: Leonid Kalinichenko, Prof. Yannis Manolopoulos, Oleg Malkov, Nikolay Skvortsov, Sergey Stupnikov, Vladimir Sukhomlin

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 19th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2017, held in Moscow, Russia, in October 2017.

The 16 revised full papers presented together with three invited papers were carefully reviewed and selected from 75 submissions. The papers are organized in the following topical sections: data analytics; next generation genomic sequencing: challenges and solutions; novel approaches to analyzing and classifying of various astronomical entities and events; ontology population in data intensive domains; heterogeneous data integration issues; data curation and data provenance support; and temporal summaries generation.

Inhaltsverzeichnis

Frontmatter

Data Analytics

Frontmatter

Deep Model Guided Data Analysis

Abstract

Data mining is currently a well-established technique and supported by many algorithms. It is dependent on the data on hand, on properties of the algorithms, on the technology developed so far, and on the expectations and limits to be applied. It must be thus matured, predictable, optimisable, evolving, adaptable and well-founded similar to mathematics and SPICE/CMM-based software engineering. Data mining must therefore be systematic if the results have to be fit to its purpose. One basis of this systematic approach is model management and model reasoning. We claim that systematic data mining is nothing else than systematic modelling. The main notion is the notion of the model in a variety of forms, abstraction and associations among models.

Yannic Ole Kropp, Bernhard Thalheim

Data Mining and Analytics for Exploring Bulgarian Diabetic Register

Abstract

This paper discusses the need of building diabetic registers in order to monitor the disease development and assess the prevention and treatment plans. The automatic generation of a nation-wide Diabetes Register in Bulgaria is presented, using outpatient records submitted to the National Health Insurance Fund in 2010–2014 and updated with data from outpatient records for 2015–2016. The construction relies on advanced automatic analysis of free clinical texts and business analytics technologies for storing, maintaining, searching, querying and analyzing data. Original frequent pattern mining algorithms enable to discover maximal frequent itemsets of simultaneous diseases for diabetic patients. We show how comorbidities, identified for patients in the prediabetes period, can help to define alerts about specific risk factors for Diabetes Mellitus type 2, and thus might contribute to prevention. We also claim that the synergy of modern analytics and data mining tools transforms a static archive of clinical patient records to a sophisticated knowledge discovery and prediction environment.

Svetla Boytcheva, Galia Angelova, Zhivko Angelov, Dimitar Tcharaktchiev

Next Generation Genomic Sequencing: Challenges and Solutions

Frontmatter

An Introduction to the Computational Challenges in Next Generation Sequencing

Abstract

During the last decade next generation sequencing has become one of the research areas that poses the most significant challenges both in terms of big data handling and algorithmic problems.

In this review we will discuss those challenges with a particular emphasis on those issues where scientific innovation will be essential to make progress.

Zoltan Szallasi

Overview of GeCo: A Project for Exploring and Integrating Signals from the Genome

Abstract

Next Generation Sequencing is a 10-year old technology for reading the DNA, capable of producing massive amounts of genomic data - in turn, reshaping genomic computing. In particular, tertiary data analysis is concerned with the integration of heterogeneous regions of the genome; this is an emerging and increasingly important problem of genomic computing, because regions carry important signals and the creation of new biological or clinical knowledge requires the integration of these signals into meaningful messages. We specifically focus on how the GeCo project is contributing to tertiary data analysis, by overviewing the main results of the project so far and by describing its future scenarios.

Stefano Ceri, Anna Bernasconi, Arif Canakoglu, Andrea Gulino, Abdulrahman Kaitoua, Marco Masseroli, Luca Nanni, Pietro Pinoli

Novel Approaches to Analyzing and Classifying of Various Astronomical Entities and Events

Frontmatter

Data Deluge in Astrophysics: Photometric Redshifts as a Template Use Case

Abstract

Astronomy has entered the big data era and Machine Learning based methods have found widespread use in a large variety of astronomical applications. This is demonstrated by the recent huge increase in the number of publications making use of this new approach. The usage of machine learning methods, however is still far from trivial and many problems still need to be solved. Using the evaluation of photometric redshifts as a case study, we outline the main problems and some ongoing efforts to solve them.

Massimo Brescia, Stefano Cavuoti, Valeria Amaro, Giuseppe Riccio, Giuseppe Angora, Civita Vellucci, Giuseppe Longo

Fractal Paradigm and IT-Technologies for Processing, Analyzing and Classifying Large Flows of Astronomical Data

Abstract

In the paper the fractal paradigm of constructing models and logical schemes of algorithms and procedures for information processing, analysis and classification of large flows of astronomical data on the orbits and trajectories of small bodies is considered. The methodology for constructing such models and schemes is based on the construction of estimates of proximity and connectivity criteria for orbits and trajectories in the space of possible states using the corresponding mathematical apparatus of fractal dimensions. The logical, algorithmic and substantial essence of the fractal paradigm is as follows. First, the processing and analysis of the data flow of orbits and trajectories is to determine whether it forms a fractal structure? If so, then one have to determine the centers of fractal connectivity of the flow and obtain estimates of the index of information connectivity of orbits or trajectories. Secondly, isolate the monofractal structures in the flow and classify them according to the attribute of belonging to the classes of a percolating fractal or a fractal aggregate.

Alexei V. Myshev, Andrei V. Dunin

Neural Gas Based Classification of Globular Clusters

Abstract

Within scientific and real life problems, classification is a typical case of extremely complex tasks in data-driven scenarios, especially if approached with traditional techniques. Machine Learning supervised and unsupervised paradigms, providing self-adaptive and semi-automatic methods, are able to navigate into large volumes of data characterized by a multi-dimensional parameter space, thus representing an ideal method to disentangle classes of objects in a reliable and efficient way. In Astrophysics, the identification of candidate Globular Clusters through deep, wide-field, single band images, is one of such cases where self-adaptive methods demonstrated a high performance and reliability. Here we experimented some variants of the known Neural Gas model, exploring both supervised and unsupervised paradigms of Machine Learning for the classification of Globular Clusters. Main scope of this work was to verify the possibility to improve the computational efficiency of the methods to solve complex data-driven problems, by exploiting the parallel programming with GPU framework. By using the astrophysical playground, the goal was to scientifically validate such kind of models for further applications extended to other contexts.

Giuseppe Angora, Massimo Brescia, Stefano Cavuoti, Giuseppe Riccio, Maurizio Paolillo, Thomas H. Puzia

Matching and Verification of Multiple Stellar Systems in the Identification List of Binaries

Abstract

Binary and multiple stellar systems have been observed using various methods and tools. Catalogs of binaries of different observational types are independent and use inherent star identification systems. Catalog rows describing components of stellar systems refer to identifiers of surveys and catalogs of single stars. The problem of cross-identification of stellar objects contained in sky surveys and catalogs of binaries of different observational types requires not only combining lists of existing identifiers of binary stars, but rather matching components and of multiple systems and pairs of components by their astrometric and astrophysical parameters. Existing identifiers are verified for belonging to matched components, pairs and systems. After that, they may be matched to one another. The framework of multiple system cross-matching presented in the paper uses domain knowledge of binaries of different observational types to form sets of matching criteria. The Identification List of Binaries (ILB) has been created after accurate matching of systems, their components and pairs of all observational types. This work continues research of binary and multiple system identification methods.

Nikolay A. Skvortsov, Leonid A. Kalinichenko, Alexey V. Karchevsky, Dana A. Kovaleva, Oleg Yu. Malkov

Aggregation of Knowledge on Star Cluster Structure and Kinematics in Data Intensive Astronomy

Abstract

A technique for studying the star motions inside the open star clusters is proposed. It allows for revealing the details of the spatial and kinematic cluster structure on the basis of precise measurements of the astrometric parameters of the stars. By successively scanning a lot of clusters and applying a uniform technique, a processing pipeline is built. Analysis of such mass processing results reveals the patterns and relationships of the internal arrangement of clusters with their parameters and position in the Galaxy. Actually, the problem has become more interesting by the recently launched space telescope Gaia, in particular, for the membership of star clusters.

Sergei V. Vereshchagin, Ekaterina S. Postnikova

Search for Short Transient Gamma-Ray Events in SPI Experiment Onboard INTEGRAL: The Algorithm and Results

Abstract

We consider the possibilities for a searching and analyzing various short transient gamma-ray events in the archival data of the SPI experiment onboard the INTEGRAL observatory. The problems of the raw observational data processing, including the search algorithm and the method of automated classification of detected events based on a set of various criteria are discussed. The results of the analysis of the SPI/INTEGRAL archived data obtained for the period 2003–2010 are presented.

Pavel Minaev, Alexei Pozanenko

Ontology Population in Data Intensive Domains

Frontmatter

Development of Ontologies of Scientific Subject Domains Using Ontology Design Patterns

Abstract

As developing ontologies of subject areas is a rather complex and time-consuming process, various methods and approaches have been proposed to simplify and facilitate it. Over the past few years, the approach based on the use of ontology design patterns has been intensively developing. The paper discusses the application of ontology design patterns in the development of ontologies of scientific subject areas. Such patterns are designed to describe the solutions of typical problems arising in ontology development. They are created in order to facilitate the process of building ontologies and to help the developers avoid some highly repetitive errors occurring in ontology modeling. The paper presents the ontology design patterns resulting from solving the problems that the authors have encountered in the development of ontologies for such scientific subject areas as archeology, computer linguistics, system studies in power engineering, active seismology, etc.

Yury Zagorulko, Olesya Borovikova, Galina Zagorulko

PROPheT – Ontology Population and Semantic Enrichment from Linked Data Sources

Abstract

Ontologies are a rapidly emerging paradigm for knowledge representation, with a growing number of applications in various domains. However, populating ontologies with massive volumes of data is an extremely challenging task. The field of ontology population offers a wide array of approaches for populating ontologies in an automated or semi-automated way. Nevertheless, most of the related tools typically analyse natural language text, while sources of more structured information like Linked Open Data would arguably be more appropriate. The paper presents PROPheT, a novel software tool for ontology population and enrichment. PROPheT can populate a local ontology model with instances retrieved from diverse Linked Data sources served by SPARQL endpoints. To the best of our knowledge, no existing tool can offer PROPheT’s diverse extent of functionality.

Marina Riga, Panagiotis Mitzias, Efstratios Kontopoulos, Ioannis Kompatsiaris

Ontological Description of Applied Tasks and Related Meteorological and Climate Data Collections

Abstract

The use of the OWL-ontology of climate information resources on the web-GIS of the Institute of Monitoring of Climatic and Ecological Systems, Siberian Branch, Russian Academy of Sciences (IMCES SB RAS) for building an A-box of knowledge base used in an intelligent decision support system (IDSS) is considered in this work. A mathematical model is described, which is used for solution of the task of water freezing and ice melting on the Ob’ river. An example is given of the reduction problem solution with ontological description of the related input and output data of the task.

Andrey Bart, Vladislava Churuksaeva, Alexander Fazliev, Evgeniy Gordov, Igor Okladnikov, Alexey Privezentsev, Alexander Titov

Heterogeneous Data Integration Issues

Frontmatter

Integration of Data on Substance Properties Using Big Data Technologies and Domain-Specific Ontologies

Abstract

A new technology for storage and categorization of heterogeneous data on the properties of matter is proposed. Availability of a multitude of heterogeneous data from a variety of sources justifies the use of one of the popular toolkit for Big Data processing, Apache Spark. Its role in the proposed technology is to manage with extensive data warehouse in text files of the JSON format. The first stage of the technology involves the conversion of primary resources (relational databases, digital archives, Web-portals, etc.) to a standardized form of the JSON document. Advantages of JSON-format - the ability to store data and metadata within a text document, accessible perceptions of a person and a computer and support for the hierarchical structures needed to represent complex and irregular data structure. The presence of such data structures is associated with the possible expansion of the subject area: new types of materials, expansion of the nomenclature of properties, and so on. For the semantic integration of resources converted to the JSON format a repository of subject-oriented ontologies is used. The search for data in the JSON document store is implemented through a combination of SPARQL and SQL queries. The first one (addressed to the ontology repository) provide the user with the ability to view and search for adequate and related concepts. The second, accessing the JSON document sets, retrieves the required data from the document body using the capabilities of Apache Spark SQL. The efficiency of the developed technology is tested on the problems of thermophysical data integration with a characteristic for them complexity of the logical structure.

Adilbek Erkimbaev, Vladimir Zitserman, Georgii Kobzev, Andrey Kosinov

Rule-Based Specification and Implementation of Multimodel Data Integration

Abstract

An approach for rule-based specification of data integration using RIF-BLD logic dialect that is a recommendation of W3C is presented. The approach allows to combine entities defined in different sources represented in different data models (relational, XML, graph-based, document-based) in the same rule. Logical semantics of RIF-BLD provides for unambiguous interpretation of data integration rules. The paper proposes an approach for implementation of RIF-BLD rules using IBM High-level integration language (HIL) as well. Thus data integration rules can be compiled into MapReduce programs and executed over Hadoop-based distributed infrastructures.

Sergey Stupnikov

Approach to Forecasting the Development of Situations Based on Event Detection in Heterogeneous Data Streams

Abstract

The article deals with the problem of automated forecasting of situations development based on analysis of heterogeneous data streams. Existing approaches to solving this problem are analyzed and their advantages and disadvantages are determined. The authors propose a novel approach to forecasting of situations development based on event detection in data streams. The article analyzes various models of events and situations and event detection methods that are used in the processing of heterogeneous data. A method for generating possible scenarios of situations development is described. The method generates scenarios using the principle of historical analogy, taking into account the dynamics of the current situation’s development. The probability of the generated scenarios’ implementation is estimated via logistic regression. The generated set of scenarios is analyzed using Analytic Hierarchy Process to identify the optimistic and the pessimistic scenario. The authors describe a way to supplement scenarios with recommendations for decision makers. The results of experimental evaluation of the quality of the proposed approach are presented.

Ark Andreev, Dmitry Berezkin, Ilya Kozlov

Integrating DBMS and Parallel Data Mining Algorithms for Modern Many-Core Processors

Abstract

Relational DBMSs (RDBMSs) remain the most popular tool for processing structured data in data intensive domains. However, most of stand-alone data mining packages process flat files outside a RDBMS. In-database data mining avoids export-import data/results bottleneck as opposed to use stand-alone mining packages and keeps all the benefits provided by a RDBMS. The paper presents an approach to data mining inside a RDBMS based on a parallel implementation of user-defined functions (UDFs). Such an approach is implemented for PostgreSQL and modern Intel MIC (Many Integrated Core) architecture. The UDF performs a single mining task on data from the specified table and produces a resulting table. The UDF is organized as a wrapper of an appropriate mining algorithm, which is implemented in C language and is parallelized by the OpenMP technology and thread-level parallelism. The heavy-weight parts of the algorithm are additionally parallelized by intrinsic functions for MIC platforms to reach the optimal loop vectorization manually. The library of such UDFs supports a cache of precomputed mining structures to reduce costs of further computations. In the experiments, the proposed approach shows good scalability and overtakes R data mining package.

Timofey Rechkalov, Mikhail Zymbler

Data Curation and Data Provenance Support

Frontmatter

Data Curation Policies and Data Provenance in EUDAT Collaborative Data Infrastructure

Abstract

The work outlines the development of a data curation and data provenance framework in the EUDAT Collaborative Data Infrastructure. Practical use cases are described, as well as results of defining and implementing data curation policies and data provenance patterns.

Vasily Bunakov, Alexander Atamas, Alexia de Casanove, Pascal Dugénie, Rene van Horik, Simon Lambert, Javier Quinteros, Linda Reijnhoudt

Temporal Summaries Generation

Frontmatter

News Timeline Generation: Accounting for Structural Aspects and Temporal Nature of News Stream

Abstract

The number of news articles that are published daily is larger than any person can afford to study. Correct summarization of the information allows for an easy search for the event of interest. This research was designed to address the issue of constructing annotations of news story. Standard multi-document summarization approaches are not able to extract all information relevant to the event. This is due to the fact that such approaches do not take into account the variability of the event context in time. We have implemented a system that automatically builds timeline summary. We investigated impact of three factors: query extension, accounting for temporal nature and structure of news article in form of inverted pyramid. The annotations that we generate are composed of sentences sorted in chronological order, which together contain the main details of the news story. The paper shows that taking into account the described factors positively affects the quality of the annotations created.

Mikhail Tikhomirov, Boris Dobrov

Backmatter

Titel: Data Analytics and Management in Data Intensive Domains
herausgegeben von: Leonid Kalinichenko
Prof. Yannis Manolopoulos
Oleg Malkov
Nikolay Skvortsov
Sergey Stupnikov
Vladimir Sukhomlin
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-96553-6
Print ISBN: 978-3-319-96552-9
DOI: https://doi.org/10.1007/978-3-319-96553-6