Top

2017 | Book

Read chapter Read first chapter

Data Analytics and Management in Data Intensive Domains

XVIII International Conference, DAMDID/RCDL 2016, Ershovo, Moscow, Russia, October 11 -14, 2016, Revised Selected Papers

Editors: Leonid Kalinichenko, Sergei O. Kuznetsov, Yannis Manolopoulos

Publisher: Springer International Publishing

Book Series : Communications in Computer and Information Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the 28th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2016, held in Ershovo, Moscow, Russia, in October 2016.

The 16 revised full papers presented together with one invited talk and two keynote papers were carefully reviewed and selected from 57 submissions. The papers are organized in topical sections on semantic modeling in data intensive domains; knowledge and learning management; text mining; data infrastructures in astrophysics; data analysis; research infrastructures; position paper.

Frontmatter

Semantic Modeling in Data Intensive Domains

Frontmatter

Conceptualization of Methods and Experiments in Data Intensive Research Domains

Abstract

Nowadays research of various scopes especially in natural sciences requires manipulation of big volumes of data generated by observation, experiments and modeling. Organization of data-intensive research assumes definition of domain specifications including concepts (specified by ontologies) and formal representation of data describing domain objects and their behavior (using conceptual schemes), shared and maintained by communities working in the respective domains. Research infrastructures are based on domain specifications and provide methods applied to such specifications, collected and developed by research communities. Tools for organizing experiments in research infrastructures are also supported by conceptual specifications of measuring and investigating object properties, applying the research methods, describing and testing the hypotheses. Astronomy as a sample data intensive domain is chosen to demonstrate building of conceptual specifications and usage of them for data analysis.

Nikolay A. Skvortsov, Leonid A. Kalinichenko, Dmitry Yu Kovalev

Semantic Search in a Personal Digital Library

Abstract

The article offers a solution to the problem of semantic search in a personal digital LibMeta library. It also describes the L-tag-based semantic search model. The article provides an algorithm to build up a keywords and clusters hierarchy by the means of iterative clustering and keywords extraction. This hierarchy is used to generate abstracts and extract L-tags from texts.

Dmitriy Malakhov, Yuri Sidorenko, Olga Ataeva, Vladimir Serebryakov

Knowledge and Learning Management

Frontmatter

Digital Ecosystem OntoMath: Mathematical Knowledge Analytics and Management

Abstract

A mathematical knowledge management technology is discussed, its basic ideas, approaches and results are based on targeted ontologies in the field of mathematics. The solution forms the basis of the specialized digital ecosystem OntoMath which consists of a set of ontologies, text analytics tools and applications for managing mathematical knowledge. The studies are in line with the project aimed to create a World Digital Mathematical Library whose objective is to design a distributed system of interconnected repositories of digitized versions of mathematical documents.

Alexander Elizarov, Alexander Kirillovich, Evgeny Lipachev, Olga Nevzorova

Development of Fuzzy Cognitive Map for Optimizing E-learning Course

Abstract

Learning management system (LMS) optimization has been one of the core issues in the face of increasing learning content supply and the rising number of online-course participants. This optimization is mostly based on LMS logs data analysis and revealing the users’ behavior patterns linked to the content. This article focuses on the approach to LMS users’ behavior pattern simulation that stems from the fuzzy cognitive maps-featuring approach. The proposed model describes the user-content interaction within the system and can be applied to predict users’ reactions to its learning, testing and practical elements. The obtained cognitive map has been tested with the INFOMEPHIST system data. This system has been used to assist leaning process in a number of National Research Nuclear University MEPhI departments for more than nine years. The current and further research is supported by the NRNU MEPhI development program.

Vasiliy S. Kireev

Text Mining

Frontmatter

Supporting Biological Pathway Curation Through Text Mining

Abstract

Text mining technology performs automated analysis of large document collections, in order to detect various aspects of information about their structure and meaning. This information can be used to develop systems that make it much easier for researchers to locate information of relevance to their needs in huge volumes of text, compared to standard search mechanisms. With a focus on the challenging task of constructing biological pathway models, which typically involves gathering, interpreting and combining complex information from a large number of publications, we show how text mining applications can provide various levels of support to ease the burden placed on pathway curators. Such support ranges from applications that provide help in searching and exploring the literature for evidence relevant to pathway reactions, to those which are able to make automated suggestions about how to construct and update pathway models.

Sophia Ananiadou, Paul Thompson

Text Processing Framework for Emergency Event Detection in the Arctic Zone

Abstract

We present the text processing framework for detection and analysis of events related to emergencies in a specified region. We consider the Arctic zone as a particular example. The peculiarity of the task consists in data sparseness and scarceness of tools/language resources for processing such specific texts. The system performs focused crawling of texts related to emergencies in the Arctic region, information extraction including named entity recognition, geotagging, vessel name recognition, and detection of emergency related messages, as well as indexing of texts with their metadata for faceted search. The framework aims at processing both English and Russian text messages and documents. We report the results of the experimental evaluation of the framework components on Twitter data.

Dmitry Devyatkin, Artem Shelmanov

Fact Extraction from Natural Language Texts with Conceptual Modeling

Abstract

The paper presents the application of Formal Concept Analysis paradigm to the fact extraction problem on natural language texts. Proposed technique combines the usage of two conceptual models: conceptual graphs and concept lattice. Conceptual graphs serve as semantic models of text sentences and the data source for concept lattice – the basic conceptual model in the Formal Concept Analysis. With the use of concept lattice it is possible to model relationships between words from different sentences from different texts. These relationships have been collected in formal concepts of concept lattice and provide interpreting formal concepts as possible facts. Facts can be extracted by using navigation in the lattice and interpretation its concepts and hierarchical links between them. Experimental investigation of the proposed technique is performed on the annotated textual corpus consisted of descriptions of biotopes of bacteria.

Mikhail Bogatyrev

Data Infrastructures in Astrophysics

Frontmatter

Hybrid Distributed Computing Service Based on the DIRAC Interware

Abstract

Scientific data intensive applications requiring simultaneous use of large amounts of computing resources are becoming quite common. Properties of applications coming from different scientific domains as well as their requirements to the computing resources are varying largely. Many scientific communities have access to different types of computing resources. Often their workflows can benefit from a combination of High Throughput (HTC) and High Performance (HPC) computing centers, cloud or volunteer computing power. However, all these resources have different user interfaces and access mechanisms, which are making their combined usage difficult for the users. This problem is addressed by projects developing software for integration of various computing centers into a single coherent infrastructure, the so-called interware. One of such software toolkits is the DIRAC interware. This product was very successful to solve problems of large High Energy Physics experiments and was reworked to offer a general-purpose solution suitable for other scientific domains. Services based on the DIRAC interware are now proposed to users of several distributed computing. One of these services is deployed at Joint Institute for Nuclear Research, Dubna. It aims at integration of computing resources of several grid and supercomputer centers as well as cloud providers. An overview of the DIRAC interware and its use for creating and operating of a hybrid distributed computing system at JINR is presented in this article.

Victor Gergel, Vladimir Korenkov, Igor Pelevanyuk, Matvey Sapunov, Andrei Tsaregorodtsev, Petr Zrelov

Hierarchical Multiple Stellar Systems

Abstract

In astrophysics of hierarchical multiple stellar systems there is a contradiction between their maximum observed multiplicity (up to seven) and its theoretical limitations (about five hundred). To search for the hierarchical systems of high multiplicity we have analyzed modern catalogues of wide and close stellar pairs. We have compiled a list of objects - candidates for the stellar systems of maximum multiplicity, which includes an accurate cross-identification of their components. Presented procedure of cross-matching of multiple stellar systems is based on applying of criteria of unified form that exclude objects from sets of possible candidates to identification. Criteria are constructed using domain knowledge on astronomical objects of certain type. These criteria are not dependent on source catalogues but take into account knowledge on specific features of objects depending on conditions of observation.

Nikolay A. Skvortsov, Leonid A. Kalinichenko, Dana A. Kovaleva, Oleg Y. Malkov

Observations of Transient Phenomena in BSA Radio Survey at 110 MHz

Abstract

One of the most sensitive radio telescopes at the frequency of 110 MHz is a Big Scanning Antenna (BSA) in Pushchino Radio Astronomy Observatory of Lebedev Physical Institute (PRAO LPI, Moscow region, Russia). Since 2012 in the BSA the continuous survey observation was started in multibeam mode in the frequency band of 109–112 MHz. Now 96 beams covering from −8 and up to +42° in declination are used. The number of frequency bands are 6 with a time resolution of 0.1 s and 32 bands with the time resolution of 0.0125 s. In a fast mode (32 bands, 0.0125 s) daily data flow is 87.5 GB (32 TB per year). The data provide a great opportunity for both short-term and long-term monitoring of the various radio sources. The sources are fast radio transients of different nature, such as fast radio bursts (FRB), possible counterparts of gamma-ray bursts (GRB), and sources of gravitational waves, the Earth’s ionosphere, interplanetary and interstellar plasma. Based on the BSA observations the database is constructed. We discuss data base properties, the methods of transient search and allocation in database. Using this database we were able to detect 83096 individual transient events in the period of July 2012 – October 2013, which may correspond to pulsars, scintillating sources and fast radio transients. We also present first results and statistics of transients classification. In particular we report parameters of two candidates in new RRAT pulsars.

Vladimir A. Samodurov, Alexey S. Pozanenko, Alexander E. Rodin, Dmitry D. Churakov, Dmitry V. Dumskij, Evgeny A. Isaev, Andrey N. Kazantsev, Sergey V. Logvinenko, Vasily V. Oreshko, Maxim O. Toropov, Maria I. Volobueva

Data Analysis

Frontmatter

Semantics and Verification of Entity Resolution and Data Fusion Operations via Transformation into a Formal Notation

Abstract

During all the period of development of data integration methods and tools the issues of formal semantics definition and verification were arising. Three levels of integration can be distinguished: data model integration, schema matching and integration and data integration proper. This paper is aimed at development of methods and tools for formal semantics definition and verification on the third level – level of data proper. An approach for definition of formal semantics for high-level data integration programs is proposed. The semantics is defined using a transformation into a formal specification language supported by automatic/interactive provers. The semantics is applied for verification of structured data integration workflows. Workflow properties to be verified are presented as expressions of the specification language chosen. After that a semantic specification of the data integration workflow is verified w.r.t. required properties. A practical aim of the work is to define a basis for formal verification of data integration workflows during problem solving in various integration environments.

Sergey Stupnikov

A Study of Several Matrix-Clustering Vertical Partitioning Algorithms in a Disk-Based Environment

Abstract

In this paper we continue our efforts to evaluate matrix clustering algorithms. In our previous study we presented a test environment and results of preliminary experiments with the “separate” strategy for vertical partitioning. This strategy assigns a separate vertical partition for every cluster found by the algorithm, including inter-submatrix attribute group. In this paper we introduce two other strategies: the “replicate” strategy, which replicates inter-submatrix attributes to every cluster and the “retain” strategy, which assigns inter-submatrix attributes to their original clusters. We experimentally evaluate all strategies in a disk-based environment using the standard TPC-H workload and the PostgreSQL DBMS. We start with the study of record reconstruction methods in the PostgreSQL DBMS. Then, we apply partitioning strategies to three matrix clustering algorithms and evaluate both query performance and storage overhead of the resulting partitions. Finally, we compare the resulting partitioning schemes with the ideal partitioning scenario.

Viacheslav Galaktionov, George Chernishev, Kirill Smirnov, Boris Novikov, Dmitry A. Grigoriev

Clustering of Goods and User Profiles for Personalizing in E-commerce Recommender Systems Based on Real Implicit Data

Abstract

The work is devoted to description of a hybrid approach to the preparation of data for electronic commerce recommender systems. The efficiency of recommendations is improved via use of various algorithms depending on what is known about the user, little or no information or it is not the first visit and the browsing history is available. In the first case our own Item-Item CF method is used enabling to solve the problem of cold start. In the second case our own method User-User CF is applied. Both methods are based on clustering of both explicit and implicit user data. This approach can increase the number of items in the basket in fewer clicks in comparison with the methods that do not use the implicit data. The approach applicability was confirmed via test on real data obtained from Thaisoap, an online store.

Victor N. Zakharov, Stanislav A. Philippov

On Data Persistence Models for Mobile Crowdsensing Applications

Abstract

In this paper, we discuss various models and solutions for saving data in crowdsensing applications. A mobile crowdsensing is a relatively new sensing paradigm based on the power of the crowd with the sensing capabilities of mobile devices, such as smartphones, wearable devices, cars with mobile equipment, etc. This conception (paradigm) becomes quite popular due to huge penetration of mobile devices equipped with multiple sensors. The conception enables to collect local information from individuals (they could be human persons or things) surrounding environment with the help of sensing features of the mobile devices. In our paper, we provide a review of the data persistence solutions (back-end systems, data stores, etc.) for mobile crowdsensing applications. The main goal of our research is to propose a software architecture for mobile crowdsensing in Smart City services. The deployment for such kind of applications in Russia has got some limitations due to legal restrictions also discussed in our paper.

Dmitry Namiot, Manfred Sneps-Sneppe

Research Infrastructures

Frontmatter

The European Strategy in Research Infrastructures and Open Science Cloud

Abstract

The European Strategy Forum on Research Infrastructures (ESFRI) was established in 2002, with a mandate from the EU Council to support a coherent and strategy-led approach to policy-making on research infrastructures in Europe, and to facilitate multilateral initiatives leading to the better use and development of Research Infrastructures (RIs), at EU and international level. ESFRI has recently presented its updated 2016 Roadmap which demonstrates the dynamism of the European scientific community and the commitment of Member States to develop new research infrastructures at the European level.

Recently, the European Open Science Cloud (EOSC) initiated activities towards facilitating integration in the area of European e-Infrastructures and connected services between the member states, at the European level and internationally. It aims to accelerate and support the transition to an effective Open Science and Open Innovation in the Digital Single Market by enabling trusted access to services, systems and re-use of scientific data.

This work is focused on the identification of the new features and conclusions of the ESFRI Roadmap 2016 in terms of the methods and procedures that led to the call, the evaluation and selection of the new ESFRI Projects and the definition and assessment of the ESFRI Landmarks. An analysis of the impact of research infrastructures on structuring the European Research Area as well as the global research scene, and of the overall contribution to European competitiveness are also discussed. The EOSC challenges, purpose and initial recommendations for a preparatory phase that will lead to the establishment of the ambitious infrastructure for Open Science are also presented.

Konstantinos M. Giannoutakis, Dimitrios Tzovaras

Creating Inorganic Chemistry Data Infrastructure for Materials Science Specialists

Abstract

The analysis of the large infrastructure projects of information support of specialists realized in the world in the field of materials science is carried out (MGI, MDF, NoMaD, etc.). The brief summary of the Russian information resources in the field of inorganic chemistry and materials science is given. The project of infrastructure for providing the Russian specialists with data in this area is proposed.

Nadezhda N. Kiselyova, Victor A. Dudarev

Visual Analytics of Multidimensional Dynamic Data with a Financial Case Study

Abstract

This work deals with a problem of analysis of time variant objects. Each object is characterized by a set of numerical parameters. The visualization method is used to conduct the analysis. Insights of interest for the analyst about the considered objects are obtained in several steps. At the first step, a geometric interpretation of the initial data is introduced. Then, the introduced geometrical model undergoes several transformations. These transformations correspond to solving the first problem of the visualization method, in particular, obtaining visual representations of data. The next step for the analyst is to analyze the generated visual images and to interpret the results in terms of the considered objects. We propose an algorithm for the problem solution. Developed interactive visualization software is described, which implements the proposed algorithm. We demonstrate how with this software the user can obtain insights regarding the creation and disappearance of object clusters and bunches and find invariants in the initial data changes.

Dmitry D. Popov, Igal E. Milman, Victor V. Pilyugin, Alexander A. Pasko

Metadata for Experiments in Nanoscience Foundries

Abstract

Metadata is a key aspect of data management. This paper describes the work of NFFA-EUROPE project on the design of a metadata standard for nanoscience, with a focus on data lifecycle and the needs of data practitioners who manage data resulted from nanoscience experiments. The methodology and the resulting high-level metadata model are presented. The paper explains and illustrates the principles of metadata design for data-intensive research. This is value to data management practitioners in all branches of research and technology that imply a so-called “visitor science” model where multiple researchers apply for a share of a certain resource on large facilities (instruments).

Vasily Bunakov, Tom Griffin, Brian Matthews, Stefano Cozzini

Position Paper

Frontmatter

Metrics and Rankings: Myths and Fallacies

Abstract

In this paper we provide an introduction to the field of Bibliometrics. In particular, first we briefly describe its beginning and its evolution; we mention the main research fora as well. Further we categorize metrics according to their entity scope: metrics for journals, conferences and authors. Several rankings have appeared based on such metrics. It is argued that these metrics and rankings should be treated with caution, in a light relative way and not in an absolute manner. Primarily, it is the human expertise that can rigorously evaluate the above entities.

Yannis Manolopoulos, Dimitrios Katsaros

Backmatter

Title: Data Analytics and Management in Data Intensive Domains
Editors: Leonid Kalinichenko
Sergei O. Kuznetsov
Yannis Manolopoulos
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-57135-5
Print ISBN: 978-3-319-57134-8
DOI: https://doi.org/10.1007/978-3-319-57135-5

Springer Professional

About this book

Table of Contents

Frontmatter

Semantic Modeling in Data Intensive Domains

Frontmatter

Conceptualization of Methods and Experiments in Data Intensive Research Domains

Semantic Search in a Personal Digital Library

Knowledge and Learning Management

Frontmatter

Digital Ecosystem OntoMath: Mathematical Knowledge Analytics and Management

Development of Fuzzy Cognitive Map for Optimizing E-learning Course

Text Mining

Frontmatter

Supporting Biological Pathway Curation Through Text Mining

Text Processing Framework for Emergency Event Detection in the Arctic Zone

Fact Extraction from Natural Language Texts with Conceptual Modeling

Data Infrastructures in Astrophysics

Frontmatter

Hybrid Distributed Computing Service Based on the DIRAC Interware

Hierarchical Multiple Stellar Systems

Observations of Transient Phenomena in BSA Radio Survey at 110 MHz

Data Analysis

Frontmatter

Semantics and Verification of Entity Resolution and Data Fusion Operations via Transformation into a Formal Notation

A Study of Several Matrix-Clustering Vertical Partitioning Algorithms in a Disk-Based Environment

Clustering of Goods and User Profiles for Personalizing in E-commerce Recommender Systems Based on Real Implicit Data

On Data Persistence Models for Mobile Crowdsensing Applications

Research Infrastructures

Frontmatter

The European Strategy in Research Infrastructures and Open Science Cloud

Creating Inorganic Chemistry Data Infrastructure for Materials Science Specialists

Visual Analytics of Multidimensional Dynamic Data with a Financial Case Study

Metadata for Experiments in Nanoscience Foundries

Position Paper

Frontmatter

Metrics and Rankings: Myths and Fallacies

Backmatter

Premium Partner