Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 10th International Conference on Data Integration in the Life Sciences, DILS 2014, held in Lisbon, Portugal, in July 2014. The 9 revised full papers and the 5 short papers included in this volume were carefully reviewed and selected from 20 submissions. The papers cover a range of important topics such as data integration platforms and applications; biodiversity data management; ontologies and visualization; linked data and query processing.



Data Integration Platforms and Applications

An Asset Management Approach to Continuous Integration of Heterogeneous Biomedical Data

Increasingly, advances in biomedical research are the result of combining and analyzing heterogeneous data types from different sources, spanning genomic, proteomic, imaging, and clinical data. Yet despite the proliferation of data-driven methods, tools to support the integration and management of large collections of data for purposes of data driven discovery are scarce, leaving scientists with ad hoc and inefficient processes. The scientific process could benefit significantly from lightweight methods for data integration that allow for exploratory, incrementally refined integration of heterogeneous data. In this paper, we address this problem by introducing a new asset management based approach designed to support continuous integration of biomedical data. We describe the system and our experiences using it in the context of several scientific applications.
Robert E. Schuler, Carl Kesselman, Karl Czajkowski

Mining Linked Open Data: A Case Study with Genes Responsible for Intellectual Disability

Linked Open Data (LOD) constitute a unique dataset that is in a standard format, partially integrated, and facilitates connections with domain knowledge represented within semantic web ontologies. Increasing amounts of biomedical data provided as LOD consequently offer novel opportunities for knowledge discovery in biomedicine. However, most data mining methods are neither adapted to LOD format, nor adapted to consider domain knowledge. We propose in this paper an approach for selecting, integrating, and mining LOD with the goal of discovering genes responsible for a disease. The selection step relies on a set of choices made by a domain expert to isolate relevant pieces of LOD. Because these pieces are potentially not linked, an integration step is required to connect unlinked pieces. The resulting graph is subsequently mined using Inductive Logic Programming (ILP) that presents two main advantages. First, the input format compliant with ILP is close to the format of LOD. Second, domain knowledge can be added to this input and considered by ILP. We have implemented and applied this approach to the characterization of genes responsible for intellectual disability. On the basis of this real-world use case, we present an evaluation of our mining approach and discuss its advantages and drawbacks for the mining of biomedical LOD.
Gabin Personeni, Simon Daget, Céline Bonnet, Philippe Jonveaux, Marie-Dominique Devignes, Malika Smaïl-Tabbone, Adrien Coulet

Data Integration between Swedish National Clinical Health Registries and Biobanks Using an Availability System

Linking biobank data, such as molecular profiles, with clinical phenotypes is of great importance in epidemiological and predictive studies. A comprehensive overview of various data sources that can be combined in order to power up a study is a key factor in the design. Clinical data stored in health registries and biobank data in research projects are commonly provisioned in different database systems and governed by separate organizations, making the integration process challenging and hampering biomedical investigations. We here describe the integration of data on prostate cancer from a clinical health registry with data from a biobank, and its provisioning in the SAIL availability system. We demonstrate the implications of using the actual raw data, data transformed to availability data, and availability data which has been subjected to anonymization techniques to reduce the risk of re-identification. Our results show that an availability system such as SAIL with integrated clinical and biobank data can be a valuable tool for planning new studies and finding interesting subsets to investigate further. We also show that an availability system can deliver useful insights even when the data has been subjected to anonymization techniques.
Ola Spjuth, Jani Heikkinen, Jan-Eric Litton, Juni Palmgren, Maria Krestyaninova

Biodiversity Data Management

Data Management Experiences and Best Practices from the Perspective of a Plant Research Institute

Research in life sciences faces increasing amounts of cross-domain data, also kown as “big data”. This has notable effects on IT-departments and the dry lab desk alike. In this paper, we report on experiences from a decade of data management in a plant research institute. We explain the switch from personally managed files and heterogeneous information systems towards a centrally organised storage management. In particular, we discuss lessons that were learned within the last decade of productive research, data generation and software development from the perspective of a modern plant research institute and present the results of a strategic realignment of the data management infrastructure. Finally, we summarise the challenges which were solved and the questions which are still open.
Daniel Arend, Christian Colmsee, Helmut Knüpffer, Markus Oppermann, Uwe Scholz, Danuta Schüler, Stephan Weise, Matthias Lange

A Semantic Web Faceted Search System for Facilitating Building of Biodiversity and Ecosystems Services

To address biodiversity issues in ecology and to assess the consequences of ecosystem changes, large quantities of long-term observational data from multiple datasets need to be integrated and characterized in a unified way. Linked open data initiatives in ecology aim at promoting and sharing such observational data at the web-scale. Here we present a web infrastructure, named Thesauform, that fully exploits the key principles of the semantic web and associated key data standards in order to guide the scientific community of experts to collectively construct, manage, visualize and query a SKOS thesaurus. The study of a thesaurus dedicated to plant functional traits demonstrates the potential of this approach. A point of great interest is to provide each expert with the opportunity to generate new knowledge and to draw novel plausible conclusions from linked data sources. Consequently, it is required to consider both the scientific topic and the objects of interest for a community of expertise. The goal is to enable users to deal with a small number of familiar and conceptual dimensions, or in other terms, facets. In this regard, a faceted search system, based on SKOS collections and enabling thesaurus browsing according to each end-users requirements is expected to greatly enhance data discovery in the context of biodiversity studies.
Marie-Angélique Laporte, Isabelle Mougenot, Eric Garnier, Ulrike Stahl, Lutz Maicher, Jens Kattge

Handling Multiple Foci in Graph Databases

Scientific research has become data-intensive and data-dependent, with distributed, multidisciplinary, teams creating and sharing their findings. Graph databases are being increasingly considered as a computational means to loosely integrate such data, in particular when relationships among data and the data itself are at the same importance level. However, a problem to be faced in this context is that of multiple foci – where a focus, here, is a perspective on the data, for a particular research team and context. This paper describes a conceptual framework for the construction of arbitrary foci on graph databases, to help solve this problem. The framework, under construction, is illustrated using examples based on needs of teams involved in biodiversity research.
Jaudete Daltio, Claudia Bauzer Medeiros

Ontologies and Visualization

Completing the is-a Structure of Biomedical Ontologies

Ontologies in the biomedical domain are becoming a key element for data integration and search. The usefulness of the applications which use ontologies is often directly influenced by the quality of ontologies, as incorrect or incomplete ontologies might lead to wrong or incomplete results for the applications. Therefore, there is an increasing need for repairing defects in ontologies. In this paper we focus on completing ontologies. We provide an algorithm for completing the is-a structure in \({\cal{EL}}\) ontologies which covers many biomedical ontologies. Further, we present an implemented system based on the algorithm as well as an evaluation using three biomedical ontologies.
Zlatan Dragisic, Patrick Lambrix, Fang Wei-Kleiner

Annotation-Based Feature Extraction from Sets of SBML Models

Model repositories such as BioModels Database provide computational models of biological systems for the scientific community. These models contain rich semantic annotations that link model entities to concepts in well-established bio-ontologies such as Gene Ontology. Consequently, thematically similar models are likely to share similar annotations. Based on this assumption, we argue that semantic annotations are a suitable tool to characterize sets of models. These characteristics can then help to classify models, to identify additional features for model retrieval tasks, or to enable the comparison of sets of models. In this paper, we present four methods for annotation-based feature extraction from model sets. All methods have been used with four different model sets in SBML format and taken from BioModels Database. To characterize each of these sets, we analyzed and extracted concepts from three frequently used ontologies for SBML models, namely Gene Ontology, ChEBI and SBO. We find that three of the four tested methods are suitable to determine characteristic features for model sets. The selected features vary depending on the underlying model set, and they are also specific to the chosen model set. We show that the identified features map on concepts that are higher up in the hierarchy of the ontologies than the concepts used for model annotations. Our analysis also reveals that the information content of concepts in ontologies and their usage for model annotation do not correlate.
Rebekka Alm, Dagmar Waltemath, Olaf Wolkenauer, Ron Henkel

REX – A Tool for Discovering Evolution Trends in Ontology Regions

A large number of life science ontologies has been developed to support different application scenarios such as gene annotation or functional analysis. The continuous accumulation of new insights and knowledge affects specific portions in ontologies and thus leads to their adaptation. Therefore, it is valuable to study which ontology parts have been extensively modified or remained unchanged. Users can monitor the evolution of an ontology to improve its further development or apply the knowledge in their applications. Here we present REX (Region Evolution Explorer) a web-based system for exploring the evolution of ontology parts (regions). REX provides an interactive and user-friendly interface to identify (un)stable regions in large life science ontologies and is available at .
Victor Christen, Anika Groß, Michael Hartung

Towards Visualizing the Alignment of Large Biomedical Ontologies

To successfully integrate biomedical data it is crucial to establish meaningful relationships between the ontologies used to annotate this data. Recent developments in ontology alignment techniques, including our AgreementMakerLight system, have been successful in matching very large biomedical ontologies. However the visualization of these alignments is still a challenge.
We have developed a graphical user interface for AgreementMakerLight that follows its core focus on computational efficiency and the handling of very large ontologies. It allows non-expert users to easily align biomedical ontologies, offering a wide selection of matching strategies and algorithms, with a particular focus on the use of external background knowledge. The visualization of the resulting alignment is based on linked subgraphs which are generated according to search queries over the full graph composed by the matched ontologies and the mappings between them. This strategy decreases the need for computational resources and improves the visualization experience, by letting the user focus on selected areas of the alignment.
Catia Pesquita, Daniel Faria, Emanuel Santos, Jean-Marc Neefs, Francisco M. Couto

Linked Data and Query Processing

Discovering Relations between Indirectly Connected Biomedical Concepts

The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. In this work we attempt to address this problem by using indirect knowledge connecting two concepts in a graph to identify hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (text) data. In this graph we attempt to mine path patterns which potentially characterize a biomedical relation. For our experimental evaluation we focus on two frequent relations, namely “has target”, and “may treat”. Our results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8. Finally, analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations.
Dirk Weissenborn, Michael Schroeder, George Tsatsaronis

Exploiting Semantics from Ontologies and Shared Annotations to Partition Linked Data

Linked Open Data initiatives have made available a diversity of collections that domain experts have annotated with controlled vocabulary terms from ontologies. We identify annotation signatures of linked data that associate semantically similar concepts, where similarity is measured in terms of shared annotations and ontological relatedness. Formally, an annotation signature is a partition or clustering of the links that represent the relationships between shared annotations. A clustering algorithm named AnnSigClustering is proposed to generate annotation signatures. Evaluation results over drug and disease datasets demonstrate the effectiveness of using annotation signatures to identify patterns among entities in the same cluster of a signature.
Guillermo Palma, Maria-Esther Vidal, Louiqa Raschid, Andreas Thor

ConQuR-Bio: Consensus Ranking with Query Reformulation for Biological Data

This paper introduces ConQuR-Bio which aims at assisting scientists when they query public biological databases. Various reformulations of the user query are generated using medical terminologies. Such alternative reformulations are then used to rank the query results using a new consensus ranking strategy. The originality of our approach thus lies in using consensus ranking techniques within the context of query reformulation. The ConQuR-Bio system is able to query the EntrezGene NCBI database. Our experiments demonstrate the benefit of using ConQuR-Bio compared to what is currently provided to users. ConQuR-Bio is available to the bioinformatics community at .
Bryan Brancotte, Bastien Rance, Alain Denise, Sarah Cohen-Boulakia

An Introduction to the Data Retrieval Facilities of the XQt Language for Scientific Data

Scientific data is stored in a wide variety of different formats. While much recent research and development have focused on specialized languages and tools to fulfill the requirements of specific domains or data structures, the need for more general technologies to enable data scientists to deal with various forms of data in a universal manner is growing. In this paper we describe data querying capabilities of the XQt language in order to show how it enables the users to author their processes in data source and format ignorant ways and to share and reuse their data, processes, and acquired skills. In addition, we describe the internals of the language, the execution pipeline, and the mapping between the domain level schemas and the physical structure of the data. The paper highlights the retrieval capabilities of XQt and illustrates some of its basic performance indicators.
Javad Chamanara, Birgitta König-Ries


Weitere Informationen

Premium Partner