main-content

## Über dieses Buch

This book constitutes revised selected papers from the 13th International Conference on Data Integration in the Life Sciences, DILS 2018, held in Hannover, Germany, in November 2018.

The 5 full, 8 short, 3 poster and 4 demo papers presented in this volume were carefully reviewed and selected from 22 submissions. The papers are organized in topical sections named: big biomedical data integration and management; data exploration in the life sciences; biomedical data analytics; and big biomedical applications.

## Inhaltsverzeichnis

### Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities

Abstract
The exponential increase of scientific publications in the bio-medical field challenges access to scientific information, which primarily is encoded by semantic relationships between medical entities, such as active ingredients, diseases, or genes. Neural language models, such as Word2Vec, offer new ways of automatically learning semantically meaningful entity relationships even from large text corpora. They offer high scalability and deliver better accuracy than comparable approaches. Still, first the models have to be tuned by testing different training parameters. Arguably, the most critical parameter is the number of training dimensions for the neural network training and testing individually different numbers of dimensions is time-consuming. It usually takes hours or even days per training iteration on large corpora. In this paper we show a more efficient way to determine the optimal number of dimensions concerning quality measures such as precision/recall. We show that the quality of results gained using simpler and easier to compute scaling approaches like MDS or PCA correlates strongly with the expected quality when using the same number of Word2Vec training dimensions. This has even more impact if after initial Word2Vec training only a limited number of entities and their respective relations are of interest.
Janus Wawrzinek, José María González Pinto, Philipp Markiewka, Wolf-Tilo Balke

### Combining Semantic and Lexical Measures to Evaluate Medical Terms Similarity

Abstract
The use of similarity measures in various domains is cornerstone for different tasks ranging from ontology alignment to information retrieval. To this end, existing metrics can be classified into several categories among which lexical and semantic families of similarity measures predominate but have rarely been combined to complete the aforementioned tasks. In this paper, we propose an original approach combining lexical and ontology-based semantic similarity measures to improve the evaluation of terms relatedness. We validate our approach through a set of experiments based on a corpus of reference constructed by domain experts of the medical field and further evaluate the impact of ontology evolution on the used semantic similarity measures.
Silvio Domingos Cardoso, Marcos Da Silveira, Ying-Chi Lin, Victor Christen, Erhard Rahm, Chantal Reynaud-Delaître, Cédric Pruski

### Construction and Visualization of Dynamic Biological Networks: Benchmarking the Neo4J Graph Database

Abstract
Genome analysis is a major precondition for future advances in the life sciences. The complex organization of genome data and the interactions between genomic components can often be modeled and visualized in graph structures. In this paper we propose the integration of several data sets into a graph database. We study the aptness of the database system in terms of analysis and visualization of a genome regulatory network (GRN) by running a benchmark on it. Major advantages of using a database system are the modifiability of the data set, the immediate visualization of query results as well as built-in indexing and caching features.
Lena Wiese, Chimi Wangmo, Lukas Steuernagel, Armin O. Schmitt, Mehmet Gültas

### A Knowledge-Driven Pipeline for Transforming Big Data into Actionable Knowledge

Abstract
Big biomedical data has grown exponentially during the last decades, as well as the applications that demand the understanding and discovery of the knowledge encoded in available big data. In order to address these requirements while scaling up to the dominant dimensions of big biomedical data –volume, variety, and veracity– novel data integration techniques need to be defined. In this paper, we devise a knowledge-driven approach that relies on Semantic Web technologies such as ontologies, mapping languages, linked data, to generate a knowledge graph that integrates big data. Furthermore, query processing and knowledge discovery methods are implemented on top of the knowledge graph for enabling exploration and pattern uncovering. We report on the results of applying the proposed knowledge-driven approach in the EU funded project iASiS (http://​project-iasis.​eu/​). in order to transform big data into actionable knowledge, paying thus the way for precision medicine and health policy making.
Maria-Esther Vidal, Kemele M. Endris, Samaneh Jozashoori, Guillermo Palma

### Leaving No Stone Unturned: Using Machine Learning Based Approaches for Information Extraction from Full Texts of a Research Data Warehouse

Abstract
Data in healthcare and routine medical treatment is growing fast. Therefore and because of its variety, possible correlation within these are becoming even more complex. Popular tools for facilitating the daily routine for the clinical researchers are more often based on machine learning (ML) algorithms. Those tools might facilitate data management, data integration or even content classification. Besides commercial functionalities, there are many solutions which are developed by the user himself for his own, specific question of research or task. One of these tasks is described within this work: qualifying the Weber fracture, an ankle joint fracture, from radiological findings with the help of supervised machine learning algorithms. To do so, the findings were firstly processed with common natural language processing (NLP) methods. For the classifying part, we used the bags-of-words-approach to bring together the medical findings on the one hand, and the metadata of the findings on the other hand, and compared several common classifier to have the best results. In order to conduct this study, we used the data and the technology of the Enterprise Clinical Research Data Warehouse (ECRDW) from Hannover Medical School. This paper shows the implementation of machine learning and NLP techniques into the data warehouse integration process in order to provide consolidated, processed and qualified data to be queried for teaching and research purposes.
Johanna Fiebeck, Hans Laser, Hinrich B. Winther, Svetlana Gerbel

### Towards Research Infrastructures that Curate Scientific Information: A Use Case in Life Sciences

Abstract
Scientific information communicated in scholarly literature remains largely inaccessible to machines. The global scientific knowledge base is little more than a collection of (digital) documents. The main reason is in the fact that the document is the principal form of communication and—since underlying data, software and other materials mostly remain unpublished—the fact that the scholarly article is, essentially, the only form used to communicate scientific information. Based on a use case in life sciences, we argue that virtual research environments and semantic technologies are transforming the capability of research infrastructures to systematically acquire and curate machine readable scientific information communicated in scholarly literature.
Markus Stocker, Manuel Prinz, Fatemeh Rostami, Tibor Kempf

### Interactive Visualization for Large-Scale Multi-factorial Research Designs

Abstract
Recent publications have shown that the majority of studies cannot be adequately reproduced. The underlying causes seem to be diverse. Usage of the wrong statistical tools can lead to the reporting of dubious correlations as significant results. Missing information from lab protocols or other metadata can make verification impossible. Especially with the advent of Big Data in the life sciences and the hereby-involved measurement of thousands of multi-omics samples, researchers depend more than ever on adequate metadata annotation. In recent years, the scientific community has created multiple experimental design standards, which try to define the minimum information necessary to make experiments reproducible. Tools help with creation or analysis of this abundance of metadata, but are often still based on spreadsheet formats and lack intuitive visualizations. We present an interactive graph visualization tailored to experiments using a factorial experimental design. Our solution summarizes sample sources and extracted samples based on similarity of independent variables, enabling a quick grasp of the scientific question at the core of the experiment even for large studies. We support the ISA-Tab standard, enabling visualization of diverse omics experiments. As part of our platform for data-driven biomedical research, our implementation offers additional features to detect the status of data generation and more.
Andreas Friedrich, Luis de la Garza, Oliver Kohlbacher, Sven Nahnsen

### FedSDM: Semantic Data Manager for Federations of RDF Datasets

Abstract
Linked open data movements have been followed successfully in different domains; thus, the number of publicly available RDF datasets and linked data based applications have increased considerably during the last decade. Particularly in Life Sciences, RDF datasets are utilized to represent diverse concepts, e.g., proteins, genes, mutations, diseases, drugs, and side effects. Albeit publicly accessible, the exploration of these RDF datasets requires the understanding of their main characteristics, e.g., their vocabularies and the connections among them. To tackle these issues, we present and demonstrate FedSDM, a semantic data manager for federations of RDF datasets. Attendees will be able to explore the relationships among the RDF datasets in a federation, as well as the characteristics of the RDF classes included in each RDF dataset (https://​github.​com/​SDM-TIB/​FedSDM).
Kemele M. Endris, Maria-Esther Vidal, Sören Auer

### Poster Paper Data Integration for Supporting Biomedical Knowledge Graph Creation at Large-Scale

Abstract
In recent years, following FAIR and open data principles, the number of available big data including biomedical data has been increased exponentially. In order to extract knowledge, these data should be curated, integrated, and semantically described. Accordingly, several semantic integration techniques have been developed; albeit effective, they may suffer from scalability in terms of different properties of big data. Even scaled-up approaches may be highly costly due to performing tasks of semantification, curation, and integration independently. To overcome these issues, we devise ConMap, a semantic integration approach which exploits knowledge encoded in ontologies to describe mapping rules in a way that performs all these tasks at the same time. The empirical evaluation of ConMap performed on different data sets shows that ConMap can significantly reduce the time required for knowledge graph creation by up to 70% of the time that is consumed following a traditional approach. Accordingly, the experimental results suggest that ConMap can be a semantic data integration solution that embody FAIR principles specifically in terms of interoperability.
Samaneh Jozashoori, Tatiana Novikova, Maria-Esther Vidal

### DISBi: A Flexible Framework for Integrating Systems Biology Data

Abstract
Systems biology aims at understanding an organism in its entirety. This objective can only be achieved with the joint effort of specialized work groups. These collaborating groups need a centralized platform for data exchange. Instead data is often uncoordinatedly managed using heterogeneous data formats. Such circumstances present a major hindrance to gaining a global understanding of the data and to automating analysis routines.
DISBi is a framework for creating an integrated online environment that solves these problems. It enables researchers to filter, integrate and analyze data directly in the browser. A DISBi application dynamically adapts to its data model. Thus DISBi offers a solution for a wide range of systems biology projects.
An example installation is available at disbi.​org. Source code and documentation are available from https://​github.​com/​DISBi/​django-disbi.
Rüdiger Busche, Henning Dannheim, Dietmar Schomburg

### Using Machine Learning to Distinguish Infected from Non-infected Subjects at an Early Stage Based on Viral Inoculation

Abstract
Gene expression profiles help to capture the functional state in the body and to determine dysfunctional conditions in individuals. In principle, respiratory and other viral infections can be judged from blood samples; however, it has not yet been determined which genetic expression levels are predictive, in particular for the early transition states of the disease onset. For these reasons, we analyse the expression levels of infected and non-infected individuals to determine genes (potential biomarkers) which are active during the progression of the disease. We use machine learning (ML) classification algorithms to determine the state of respiratory viral infections in humans exploiting time-dependent gene expression measurements; the study comprises four respiratory viruses (H1N1, H3N2, RSV, and HRV), seven distinct clinical studies and 104 healthy test candidates involved overall. From the overall set of 12,023 genes, we identified the 10 top-ranked genes which proved to be most discriminatory with regards to prediction of the infection state. Our two models focus on the time stamp nearest to $$t = 48$$ hours and nearest to $$t =$$Onset Time” denoting the symptom onset (at different time points) according to the candidate’s specific immune system response to the viral infection. We evaluated algorithms including k-Nearest Neighbour (k-NN), Random Forest, linear Support Vector Machine (SVM), and SVM with radial basis function (RBF) kernel, in order to classify whether the gene expression sample collected at early time point t is infected or not infected. The “Onset Time” appears to play a vital role in prediction and identification of ten most discriminatory genes.
Ghanshyam Verma, Alokkumar Jha, Dietrich Rebholz-Schuhmann, Michael G. Madden

### Automated Coding of Medical Diagnostics from Free-Text: The Role of Parameters Optimization and Imbalanced Classes

Abstract
The extraction of codes from Electronic Health Records (EHR) data is an important task because extracted codes can be used for different purposes such as billing and reimbursement, quality control, epidemiological studies, and cohort identification for clinical trials. The codes are based on standardized vocabularies. Diagnostics, for example, are frequently coded using the International Classification of Diseases (ICD), which is a taxonomy of diagnosis codes organized in a hierarchical structure. Extracting codes from free-text medical notes in EHR such as the discharge summary requires the review of patient data searching for information that can be coded in a standardized manner. The manual human coding assignment is a complex and time-consuming process. The use of machine learning and natural language processing approaches have been receiving an increasing attention to automate the process of ICD coding. In this article, we investigate the use of Support Vector Machines (SVM) and the binary relevance method for multi-label classification in the task of automatic ICD coding from free-text discharge summaries. In particular, we explored the role of SVM parameters optimization and class weighting for addressing imbalanced class. Experiments conducted with the Medical Information Mart for Intensive Care III (MIMIC III) database reached 49.86% of f1-macro for the 100 most frequent diagnostics. Our findings indicated that optimization of SVM parameters and the use of class weighting can improve the effectiveness of the classifier.
Luiz Virginio, Julio Cesar dos Reis

### A Learning-Based Approach to Combine Medical Annotation Results

(Short Paper)
Abstract
There exist many tools to annotate mentions of medical entities in documents with concepts from biomedical ontologies. To improve the overall quality of the annotation process, we propose the use of machine learning to combine the results of different annotation tools. We comparatively evaluate the results of the machine-learning based approach with the results of the single tools and a simpler set-based result combination.
Victor Christen, Ying-Chi Lin, Anika Groß, Silvio Domingos Cardoso, Cédric Pruski, Marcos Da Silveira, Erhard Rahm

### Knowledge Graph Completion to Predict Polypharmacy Side Effects

Abstract
The polypharmacy side effect prediction problem considers cases in which two drugs taken individually do not result in a particular side effect; however, when the two drugs are taken in combination, the side effect manifests. In this work, we demonstrate that multi-relational knowledge graph completion achieves state-of-the-art results on the polypharmacy side effect prediction problem. Empirical results show that our approach is particularly effective when the protein targets of the drugs are well-characterized. In contrast to prior work, our approach provides more interpretable predictions and hypotheses for wet lab validation.
Brandon Malone, Alberto García-Durán, Mathias Niepert

### Lung Cancer Concept Annotation from Spanish Clinical Narratives

Abstract
Recent rapid increase in the generation of clinical data and rapid development of computational science make us able to extract new insights from massive datasets in healthcare industry. Oncological Electronic Health Records (EHRs) are creating rich databases for documenting patient’s history and they potentially contain a lot of patterns that can help in better management of the disease. However, these patterns are locked within free text (unstructured) portions of EHRs and consequence in limiting health professionals to extract useful information from them and to finally perform Query and Answering (Q&A) process in an accurate way. The Information Extraction (IE) process requires Natural Language Processing (NLP) techniques to assign semantics to these patterns. Therefore, in this paper, we analyze the design of annotators for specific lung cancer concepts that can be integrated over Apache Unstructured Information Management Architecture (UIMA) framework. In addition, we explain the details of generation and storage of annotation outcomes.
Marjan Najafabadipour, Juan Manuel Tuñas, Alejandro Rodríguez-González, Ernestina Menasalvas

### Linked Data Based Multi-omics Integration and Visualization for Cancer Decision Networks

Abstract
Visualization of Gene Expression (GE) is a challenging task since the number of genes and their associations are difficult to predict in various set of biological studies. GE could be used to understand tissue-gene-protein relationships. Currently, Heatmaps is the standard visualization technique to depict GE data. However, Heatmaps only covers the cluster of highly dense regions. It does not provide the Interaction, Functional Annotation and pooled understanding from higher to lower expression. In the present paper, we propose a graph-based technique - based on color encoding from higher to lower expression map, along with the functional annotation. This visualization technique is highly interactive (HeatMaps are mainly static maps). The visualization system here explains the association between overlapping genes with and without tissues types. Traditional visualization techniques (viz-Heatmaps) generally explain each of the association in distinct maps. For example, overlapping genes and their interactions, based on co-expression and expression cut off are three distinct Heatmaps. We demonstrate the usability using ortholog study of GE and visualize GE using GExpressionMap. We further compare and benchmark our approach with the existing visualization techniques. It also reduces the task to cluster the expressed gene networks further to understand the over/under expression. Further, it provides the interaction based on co-expression network which itself creates co-expression clusters. GExpressionMap provides a unique graph-based visualization for GE data with their functional annotation and associated interaction among the DEGs (Differentially Expressed Genes).
Alokkumar Jha, Yasar Khan, Qaiser Mehmood, Dietrich Rebholz-Schuhmann, Ratnesh Sahay

### The Hannover Medical School Enterprise Clinical Research Data Warehouse: 5 Years of Experience

Abstract
The reuse of routine healthcare data for research purposes is challenging not only because of the volume of the data but also because of the variety of clinical information systems. A data warehouse based approach enables researchers to use heterogeneous data sets by consolidating and aggregating data from various sources. This paper presents the Enterprise Clinical Research Data Warehouse (ECRDW) of the Hannover Medical School (MHH). ECRDW has been developed since 2011 using the Microsoft SQL Server Data Warehouse and Business Intelligence technology and operates since 2013 as an interdisciplinary platform for research relevant questions at the MHH. ECRDW incrementally integrates heterogeneous data sources and currently contains (as of 8/2018) data of more than 2,1 million distinct patients with more than 500 million single data points (diagnoses, lab results, vital signs, medical records, as well as metadata to linked data, e.g. biospecimen or images).
Svetlana Gerbel, Hans Laser, Norman Schönfeld, Tobias Rassmann

### User-Driven Development of a Novel Molecular Tumor Board Support Tool

Abstract
Nowadays personalized medicine is of increasing importance, especially in the field of cancer therapy. More and more hospitals are conducting molecular tumor boards (MTBs) bringing together experts from various fields with different expertise to discuss patient cases taking into account genetic information from sequencing data. Yet, there is still a lack of tools to support collaborative exploration and decision making. To fill this gap, we developed a novel user interface to support MTBs. A task analysis of MTBs currently held at German hospitals showed, that there is less collaborative exploration during the meeting as expected, with a large part of the information search being done during the MTB preparation. Thus we designed our interface to support both situations, a single user preparing the MTB and the presentation of information and group discussion during the meeting.
Marc Halfmann, Holger Stenzhorn, Peter Gerjets, Oliver Kohlbacher, Uwe Oestermeier

### Using Semantic Programming for Developing a Web Content Management System for Semantic Phenotype Data

Abstract
We present a prototype of a semantic version of Morph·D·Base that is currently in development. It is based on SOCCOMAS, a semantic web content management system that is controlled by a set of source code ontologies together with a Java-based middleware and our Semantic Programming Ontology (SPrO). The middleware interprets the descriptions contained in the source code ontologies and dynamically decodes and executes them to produce the prototype. The Morph·D·Base prototype in turn allows the generation of instance-based semantic morphological descriptions through completing input forms. User input to these forms generates data in form of semantic graphs. We show with examples how the prototype has been described in the source code ontologies using SPrO and demonstrate live how the middleware interprets these descriptions and dynamically produces the application.
Lars Vogt, Roman Baum, Christian Köhler, Sandra Meid, Björn Quast, Peter Grobe

Abstract