Top

2019 | Book

Read chapter Read first chapter

Data Analytics and Management in Data Intensive Domains

20th International Conference, DAMDID/RCDL 2018, Moscow, Russia, October 9–12, 2018, Revised Selected Papers

Editors: Prof. Yannis Manolopoulos, Sergey Stupnikov

Publisher: Springer International Publishing

Book Series : Communications in Computer and Information Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the 20th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2018, held in Moscow, Russia, in October 2018.

The 9 revised full papers presented together with three invited papers were carefully reviewed and selected from 54 submissions. The papers are organized in the following topical sections: FAIR data infrastructures, interoperability and reuse; knowledge representation; data models; data analysis in astronomy; text search and processing; distributed computing; information extraction from text.

Frontmatter

FAIR Data Infrastructures, Interoperability and Reuse

Frontmatter

FAIR Principles and Digital Objects: Accelerating Convergence on a Data Infrastructure

Abstract

As Moore’s Law and associated technical advances continue to bulldoze their way through society, both exciting possibilities and severe challenges emerge. The upside is the explosive growth of data and compute resources that promise revolutionary modes of discovery and innovation not only within traditional knowledge disciplines, but especially between them. The challenge, however, is to build the large-scale, widely accessible, persistent and automated infrastructures that will be necessary for navigating and managing the unprecedented complexity of exponentially increasing quantities of distributed and heterogenous data. This will require innovations in both the technical and social domains. Inspired by the successful development of the Internet and leveraging the Digital Object Framework and FAIR Principles (for making data Findable, Accessible, Interoperable and Reusable by machines) the GO FAIR initiative works with voluntary stakeholders to accelerate convergence on minimal standards and working implementations leading to an Internet of FAIR Data and Services (IFDS). In close collaboration with GO FAIR and DONA, the RDA GEDE and C2CAMP initiatives will continue its FAIR DO implementation efforts..

Erik Schultes, Peter Wittenburg

Extensible Unifying Data Model Design for Data Integration in FAIR Data Infrastructures

Abstract

According to the Open Science paradigm data sources are to be concentrated within research data infrastructures intended to support the whole cycle of data management and processing. FAIR data management and stewardship principles that had being developed and announced recently state that data within a data infrastructure have to be findable, accessible, interoperable and reusable. Note that data sources can be quite heterogeneous and represented using very different data models. Variety of data models includes traditional relational model and its object-relational extensions, array and graph-based models, semantic models like RDF and OWL, models for semi-structured data like NoSQL, XML, JSON and so on. This particular paper overviews data model unification techniques considered as a formal basis for (meta)data interoperability, integration and reuse within FAIR data infrastructures. These techniques are intended to deal with heterogeneity of data models and their data manipulation languages used to represent data and provide access to data in data sources. General principles of data model unification, languages and formal methods required, stages of data model unification are considered and illustrated by examples. Application of the techniques for data integration within FAIR data infrastructures is discussed.

Sergey Stupnikov, Leonid Kalinichenko

Meaningful Data Reuse in Research Communities

Abstract

FAIR data principles declare data interoperability and reuse according to machine and human readable shared specifications. Adherence to this set of principles brings some implications for data infrastructures and research communities. Meaningful data exchange and reuse by humans and machines require formal specifications of research domains accompanying data and allowing automatic reasoning. Development of formal conceptual specifications in research communities can be stimulated by a necessity to reach semantic interoperability of data collections and components, and reuse of data resources. Usage of formal domain specifications reduces data heterogeneity costs. Formal reasoning allows meaningful search and verified reuse of data, methods, and processes from collections. These means can make research lifecycle in communities more efficient. A lifecycle includes collecting domain knowledge specifications, classifying all data, methods, and processes according to such specifications, reusing relevant data and methods, and collecting and sharing results for reuse.

Nikolay A. Skvortsov

Knowledge Representation

Frontmatter

Tabular and Graphic Resources in Quantitative Spectroscopy

Abstract

An approach to forming applied ontologies in subject domains in which data are presented in various forms of tables and scientific graphics is proposed. A description of the sources of data and information presented in this form is given. Using quantitative spectroscopy as an example, an approach to forming semantic annotations characterizing these sources is demonstrated. The major types of the sources are described. For scientific graphics, an approach to solving the problem of reducing and systematizing the graphic resources to search for plots in the subject domain is described. A partition into groups of functions used in the plots that are not interrelated with each other is constructed to define different spectral functions to be equivalent. The metrics of three applied ontologies of spectroscopy used in comparing data collections are briefly described.

Nikolai A. Lavrentiev, Alexey I. Privezentsev, Alexander Z. Fazliev

Data Models

Frontmatter

The Principles and the Conceptual Architecture of the Metagraph Storage System

Abstract

This paper discusses an approach for active metagraph model storage. The formal definition of the metagraph data model is proposed. The example of data metagraph model is given. The formal definition of the metagraph function and rule agents are discussed. The example of a metagraph rule agent is given. It is shown that the distinguishing feature of the metagraph agent is its homoiconicity which means that it can be a data structure for itself. Thus, the metagraph agent can change both data metagraph fragments and the structure of other metagraph agents. The definition of active metagraph is given. The possible states of active metagraph elements and transitions between them are discussed. The conceptual architecture of the metagraph storage system based on active metagraph is proposed. The approaches for mapping the metagraph model to the flat graph, document-oriented, and relational data models are proposed. The experiments result for storing the metagraph model in different databases are given. It is shown that the flat graph model is most suitable for metagraph storage.

Valeriy M. Chernenkiy, Yuriy E. Gapanyuk, Georgiy I. Revunkov, Ark M. Andreev, Yuriy T. Kaganov, Ivan V. Dunin, Maxim A. Lyaskovsky

Data Analysis in Astronomy

Frontmatter

Evaluation of Binary Star Formation Models Using Well-Observed Visual Binaries

Abstract

Creation of the Galaxy model describing formation and evolution of binary stars requires generating and testing hypotheses related to the process of formation of binary stars and distributions of their characteristics. A set of hypotheses can be generated on the basis of a number of publications that suggested the formation of binary systems. We describe the project aimed at finding initial distributions of binary stars over masses of components, mass ratios of them, semi-major axes and eccentricities of orbit, and also pairing scenarios by means of Monte-Carlo modeling of the sample of visual binaries of luminosity class V with a set of additional restrictions, so it can be considered as free of observational incompleteness effects. We present results which allow rejecting some estimated initial distributions of visual binary star parameters.

Oleg Malkov, Dmitry Chulkov, Yikdem Gebrehiwot, Dana Kovaleva, Nikolay A. Skvortsov, Alexey Sytov, Solomon Belay Tessema, Alexander Tutukov, Lev Yungelson

Text Search and Processing

Frontmatter

Proximity Full-Text Search by Means of Additional Indexes with Multi-component Keys: In Pursuit of Optimal Performance

Abstract

Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in a text, we use additional indexes to store information about nearby words that are at distances from the given word of less than or equal to the MaxDistance parameter. We showed that additional indexes with three-component keys can be used to improve the average query execution time by up to 94.7 times if the queries consist of high-frequency occurring words. In this paper, we present a new search algorithm with even more performance gains. We consider several strategies for selecting multi-component key indexes for a specific query and compare these strategies with the optimal strategy. We also present the results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.

Alexander B. Veretennikov

Scope and Challenges of Language Modelling - An Interrogative Survey on Context and Embeddings

Abstract

In this work we explore the domain of Language Modelling. We focus here on different context selection strategies, data augmentation techniques, and word embedding models. Many of the existing approaches are difficult to understand without specific expertise in this domain. Therefore, we concentrate on appropriate explanations and representations that enable us to compare several approaches.

Matthias Nitsche, Marina Tropmann-Frick

Distributed Computing

Frontmatter

Scalable Algorithm for Subsequence Similarity Search in Very Large Time Series Data on Cluster of Phi KNL

Abstract

Nowadays, subsequence similarity search under the Dynamic Time Warping (DTW) similarity measure is applied in a wide range of time series mining applications. Since the DTW measure has a quadratic computational complexity w.r.t. the length of query subsequence, a number of parallel algorithms for various many-core architectures have been developed, namely FPGA, GPU, and Intel MIC. In this paper, we propose a novel parallel algorithm for subsequence similarity search in very large time series data on computing cluster with nodes based on the Intel Xeon Phi Knights Landing (KNL) many-core processors. Computations are parallelized both at the level of all cluster nodes through MPI, and within a single cluster node through OpenMP. The algorithm involves additional data structures and redundant computations, which make it possible to effectively use Phi KNL for vector computations. Experimental evaluation of the algorithm on real-world and synthetic datasets shows that it is highly scalable.

Yana Kraeva, Mikhail Zymbler

Information Extraction from Text

Frontmatter

Neural Network Approach for Extracting Aggregated Opinions from Analytical Articles

Abstract

Large texts that analyze a situation in some domain, for example politics or economy, usually are full of opinions. In case of analytical articles, opinions usually are a kind of attitudes with source and target presented as named entities, both mentioned in the text. We present an application of the specific neural network model for sentiment attitude extraction. This problem is considered as a three-class machine learning task for the whole documents. Treating text attitudes as a list of related contexts, we first extract related sentiment contexts and then calculate the resulted attitude label. For sentiment context extraction, we use Piecewise Convolutional Neural Network (PCNN). We experiment with variety of functions that allows us to compose the attitude label, including recurrent neural network, which give the possibility to take into account additional context aspects. For experiments, the RuSentRel corpus was used, it contains Russian analytical texts in the domain of international relations.

Nicolay Rusnachenko, Natalia Loukachevitch

Discovering, Classification, and Localization of Emergency Events via Analyzing of Social Network Text Streams

Abstract

We present text processing framework for discovering, classification, and localization emergency related events via analysis of information sources such as social networks. The framework performs focused crawling of messages from social networks, text parsing, information extraction, detection of messages related to emergencies, automatic novel event discovering, matching them across different sources, as well as event localization and visualization on a geographical map. For detection of emergency-related messages, we use CNN and word embeddings. The components of the framework are experimentally evaluated on Twitter and Facebook data.

Dmitriy Deviatkin, Artem Shelmanov, Daniil Larionov

Citation Content Analysis and a Digital Library

Abstract

This paper presents an approach of two-way data exchange between the citation content analysis, provided by the Cirtec project, and the big research digital library Socionet. Many papers in Socionet have citation relationships with other papers and also linkages with authors’ personal profiles and through them with other information objects. It allows making an enrichment of data for the citation content analysis by different additional information and, as well, linking results of such analysis with objects in a digital library, like papers, their authors, affiliation organizations, etc. We discuss what numeric and qualitative indicators can be built by citation content analysis based on the Cirtec open citation data. Since these indicators have IDs related with digital library objects, they can be integrated and visualized as computer-generated annotations to papers’ full texts in PDF.

Sergey Parinov

Backmatter

Title: Data Analytics and Management in Data Intensive Domains
Editors: Prof. Yannis Manolopoulos
Sergey Stupnikov
Publisher: Springer International Publishing
Electronic ISBN: 978-3-030-23584-0
Print ISBN: 978-3-030-23583-3
DOI: https://doi.org/10.1007/978-3-030-23584-0

Springer Professional

Data Analytics and Management in Data Intensive Domains

20th International Conference, DAMDID/RCDL 2018, Moscow, Russia, October 9–12, 2018, Revised Selected Papers

About this book

Table of Contents

Frontmatter

FAIR Data Infrastructures, Interoperability and Reuse

Frontmatter

FAIR Principles and Digital Objects: Accelerating Convergence on a Data Infrastructure

Extensible Unifying Data Model Design for Data Integration in FAIR Data Infrastructures

Meaningful Data Reuse in Research Communities

Knowledge Representation

Frontmatter

Tabular and Graphic Resources in Quantitative Spectroscopy

Data Models

Frontmatter

The Principles and the Conceptual Architecture of the Metagraph Storage System

Data Analysis in Astronomy

Frontmatter

Evaluation of Binary Star Formation Models Using Well-Observed Visual Binaries

Text Search and Processing

Frontmatter

Proximity Full-Text Search by Means of Additional Indexes with Multi-component Keys: In Pursuit of Optimal Performance

Scope and Challenges of Language Modelling - An Interrogative Survey on Context and Embeddings

Distributed Computing

Frontmatter

Scalable Algorithm for Subsequence Similarity Search in Very Large Time Series Data on Cluster of Phi KNL

Information Extraction from Text

Frontmatter

Neural Network Approach for Extracting Aggregated Opinions from Analytical Articles

Discovering, Classification, and Localization of Emergency Events via Analyzing of Social Network Text Streams

Citation Content Analysis and a Digital Library

Backmatter

Premium Partner