ONTOFUSION: Ontology-based integration of genomic and clinical databases

https://doi.org/10.1016/j.compbiomed.2005.02.004Get rights and content

Abstract

ONTOFUSION is an ontology-based system designed for biomedical database integration. It is based on two processes: mapping and unification. Mapping is a semi-automated process that uses ontologies to link a database schema with a conceptual framework—named virtual schema. There are three methodologies for creating virtual schemas, according to the origin of the domain ontology used: (1) top-down—e.g. using an existing ontology, such as the UMLS or Gene Ontology—, (2) bottom-up—building a new domain ontology— and (3) a hybrid combination. Unification is an automated process for integrating ontologies and hence the database to which they are linked. Using these methods, we employed ONTOFUSION to integrate a large number of public genomic and clinical databases, as well as biomedical ontologies.

Introduction

New technologies are being created to facilitate information search, access, retrieval and gathering from remote sources over the World Wide Web. In this scenario, developers are looking forward to the Semantic Web and related technologies that should facilitate information-related tasks in many areas. One such area is biomedicine, where collaborative efforts over the Web have led to significant scientific advances and accelerated efforts such as the Human Genome Project, among others. In this regard, research carried out during the last few decades has led to controlled vocabularies and taxonomies such as the UMLS [1], Gene Ontology [2], and others.

Ontologies provide the basis for the Semantic Web. Historically, the concept of ontology has a philosophical meaning, related to metaphysics. In informatics, ontologies provide a conceptual framework for modeling a knowledge domain. Considering medicine and biology, ontologies can contribute to bridging the gap between both fields by providing new conceptual frameworks. For instance, in the area of heterogeneous database integration, ontologies will provide the platform for sharing common vocabularies by modeling scientific domains. This exchange should prove fundamental in issues such as genomic medicine, where genomic and medical information will be jointly collected and analyzed to create new models of health care.

Biological and medical databases have traditionally been separate. Recent developments, such as the Iceland database [3], the biobanks and a clinical/genomic database under construction at the Mayo Clinic in Rochester, USA [4], are being carried out to gather biological and medical information. In this sense, ontologies can be particularly helpful for providing integrated approaches to data collection and analysis.

In this paper, we describe a project carried out over the last few years with support from the European Commission. This project, called INFOGENMED, aimed to develop various methods and tools for database integration from remote sources, based on intelligent agents and ontologies. The focus of this paper is related to the components of the system that are directly linked to ontologies. The system has been implemented and evaluated with biological and medical information. However, given its domain independent features, the ONTOFUSION system can be also used in other application domains.

The paper is organized as follows. Section 2 gives background on existing database integration methods and ontologies, especially from the biomedical point of view. In Section 3, we present the ONTOFUSION approach to database integration. Section 4 describes the evaluation of the system and Section 5 provides some discussion. Finally, Section 6 gives some conclusions and directions for further research.

Section snippets

Background

Biomedical institutions are producing an increasing amount of data. Given this scenario, professionals are demanding new models and tools to search, store and analyze information. Since the development of the World Wide Web, collaborative efforts among remote institutions and researchers have increased the need for information exchange and distributed data processing. We provide below a description of recent research on database integration. Since the latest efforts on database integration

The ONTOFUSION approach to database integration

Database integration at a semantic level is a key issue for providing homogeneous access to clinical and genetic databases. The integration approach used in ONTOFUSION is based on two processes: mapping and unification. In the mapping process, the physical schema of each database is mapped to what we call a “virtual schema”. Virtual schemas are ontologies representing the structure of the information contained in a given database at a conceptual level. In the unification process, several

System evaluation

The system has been successfully tested with twenty databases:

  • Eight private databases containing biomedical information of various types and stored in database management systems, such as MySQL, PointBase, Access, and others.

  • Nine public databases: Ensembl, SwissProt, OMIM, Prosite, SNP, PDB, ENZYME, LocusLink, and InterPRO.

  • Three databases containing biomedical ontologies: UMLS, GO and HGNC.

Although ONTOFUSION is a research tool and needs additional refinement, results are promising. A large

Discussion

ONTOFUSION has been implemented using a multiagent architecture. User agents play the role of users in the system and virtual schema agents act as wrappers of physical or virtual (unified) databases. Fig. 6 shows an example of the agent messages involved when a query is received within the system. Virtual agents are connected following the unification hierarchy of the databases. When a user submits a query to the system, it is translated, divided and transferred by the virtual schema agents

Conclusions

Modern database integration tools are moving towards ontology-based approaches. Our approach, ONTOFUSION, has followed this trend. Ontology-based systems offer the possibility of navigating through the ontology concepts and exploring their relationships. These approaches ease the understanding of these concepts and their underlying knowledge. This is especially important in fields such as biology and medicine, where the number of concepts is very large and new concepts are appearing all the

Summary

New technologies are being created to facilitate information search, access, retrieval and gathering from remote sources over the World Wide Web. In this paper, we describe ONTOFUSION, an approach to information integration that has been developed as part of a project carried out over the last few years with support from the European Commission. This project aimed to develop various methods and tools for integrating databases from heterogeneous sources, using intelligent agents and ontologies.

Acknowledgements

This research has been supported by funding from the EC INFOGENMED project and the INFOBIOMED Network of Excellence, the INBIOMED project, Ministry of Health, Spain, and the Ministry of Science and Technology, Spain.

David Pérez del Rey is a research assistant at the Biomedical Informatics Group at the Polytechnical University of Madrid (Spain). He received a B.S. in Computer Science from the Complutense University of Madrid, including a year in the University of Southampton as a visitor student. He is currently finishing his Ph.D. thesis on Ontology-based KDD process for biomedical information. His research interests include data integration, data mining, KDD and the Semantic Web. Contact him at the School

References (31)

  • T.R. Gruber

    A translation approach to portable ontology specifications

    Knowledge Acquisition

    (1993)
  • C. Lindberg

    The Unified Medical Language System (UMLS) of the National Library of Medicine

    J. Am. Med. Record Assoc.

    (1990)
  • The Gene Ontology Consortium, Gene ontology: tool for the unification of biology, Nat. Genet. 25 (2000)...
  • G.J. Annas

    Rules for research on human genetic variation-lessons from Iceland

    New England J. Med.

    (2000)
  • P.C. de Groen, A healthy database, IBM creating a system for millions of Mayo clinic patient files, in: Renee Berg...
  • R. Kimball et al.

    The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling

    (2002)
  • A.S. Lopatenko, Information retrieval in current research information systems, Workshop on Knowledge Markup and...
  • G. Wiederhold

    Mediators in the architecture of future information systems

    IEEE Comput.

    (1992)
  • S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom, The TSIMMIS project:...
  • Y. Arens et al.

    Query processing in the SIMS information mediator

  • C.A. Knoblock et al.

    The Ariadne approach to Web-based information integration

    Int. J. Cooperative Inform. Syst.

    (2001)
  • M.C. Shan et al.

    Pegasus: a heterogeneous information management system

  • M.J. Carey, L.M. Haas, P.M. Schwarz, M. Arya, W.F. Cody, R. Fagin, M. Flickner, A.W. Luniewski, W. Niblack, D....
  • L.M. Haas et al.

    Discoverylink: a system for integrated access to life sciences data sources

    IBM Syst. J.

    (2001)
  • P.G. Baker et al.

    TAMBIS: transparent access to multiple bioinformatics information sources

    Bioinformatics

    (2000)
  • Cited by (74)

    • Translational bioinformatics: Informatics, medicine, and -omics

      2019, Encyclopedia of Biomedical Engineering
    • Validating an ontology-based algorithm to identify patients with Type 2 Diabetes Mellitus in Electronic Health Records

      2014, International Journal of Medical Informatics
      Citation Excerpt :

      The DMO was subsequently implemented, using the Semantic Protocol and RDF Query Language (SPARQL) to identify T2DM phenotypes in a EHR-derived dataset. By incorporating defined semantic SPARQL queries, DMO was able to generate logical inferences and control the inclusion/exclusion of relevant objects [11], such as the patient with a T2DM-specific RFV, Rx or pathology test [25]. The validation of the DMO-based algorithm included a comparison with a manual audit of the EHR from which the data was derived.

    • RDFBuilder: A tool to automatically build RDF-based interfaces for MAGE-OM microarray data sources

      2013, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Over the last few years, semantic homogenization efforts have been commonly addressed by applying semantic web technologies [3–5]. A significant example is the development of ontologies that act as unified vocabularies for the biomedical domain [6,7]. To name a few examples, the Foundational Model of Anatomy (FMA) contains a symbolic representation of the phenotypic structure of the human body [8], the Gene Ontology (GO) offers a representation of gene and gene product attributes across species [9], and the ACGT Master Ontology represents the domain of cancer research [10].

    • Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature

      2013, International Journal of Medical Informatics
      Citation Excerpt :

      Enterprise Ontology [46] and TOronto Virtual Enterprise (TOVE) [47]; representation languages such as OWL, SWRL, XML and RDF; logic ontology reasoners [48] to provide automated support for reasoning tasks in ontology and instance checking [46] such as Pellet, Hermit, Fact++, Cyc; and layered ontology methodology and tools such as ontology-based multi-agent systems (OBMAS) [49*,50*]. The tasks involved in the development of a DQ ontology [51*,52*] include the: review of concepts required for ontological views of DQ, capture of terms to produce ontologies for DQ [52*], identification of errors in DQ and DQ ontologies, integration of data from heterogeneous clinical databases [39], and evaluation of DQ and DQ ontology [53*]. Ontology tools are currently the subject of a more detailed literature review.

    • Clinical Natural Language Processing Systems for Information Retrieval from Unstructured Medical Narratives

      2023, Medical Data Analysis and Processing using Explainable Artificial Intelligence
    View all citing articles on Scopus

    David Pérez del Rey is a research assistant at the Biomedical Informatics Group at the Polytechnical University of Madrid (Spain). He received a B.S. in Computer Science from the Complutense University of Madrid, including a year in the University of Southampton as a visitor student. He is currently finishing his Ph.D. thesis on Ontology-based KDD process for biomedical information. His research interests include data integration, data mining, KDD and the Semantic Web. Contact him at the School of Computer Science, Universidad Politécnica de Madrid, 28660 Boadilla del Monte, Madrid (Spain); [email protected]

    Victor Maojo got his MD degree at the University of Oviedo (Spain) in 1985 and his Ph.D. in Computer Science at the Universidad Politécnica de Madrid (UPM) in 1990. At the UPM, he is currently an associate professor and associate director of the Artificial Intelligence Lab. Before entering the faculty of the UPM, he was a postdoctoral researcher and consultant in Georgia Tech (Atlanta, USA, 1990–1991), and a research fellow at the Decision Systems Group (Harvard University-MIT, Boston, USA, 1991–1993). He has been the principal investigator in more than 20 national and international projects and has authored more than 100 scientific papers and books. He has been a member of numerous committees at international conferences and journals and served as an expert for the IV and V Framework Programmes of the European Commission.

    Miguel García Remesal is a research assistant at the Biomedical Informatics Group at the Universidad Politécnica de Madrid (Spain). He received a B.S. in Computer Science from the Universidad Politécnica de Madrid. He is currently finishing his Ph.D. thesis on Ontology-based Information Retrieval for biomedical information resources. His research interests include information retrieval, text mining, and the Semantic Web. Contact him at the School of Computer Science, Polytechnical University of Madrid, 28660 Boadilla del Monte, Madrid (Spain); [email protected]

    Raúl Alonso Calvo is a research assistant at the Biomedical Informatics Group at the Universidad Politécnica de Madrid (Spain). He received a B.S. in Computer Science from the Universidad Politécnica de Madrid. He is currently finishing his Ph.D. thesis on Content-Based Image Retrieval and Ontology-based Information Retrieval for biomedical information resources. His research interests include image analysis, information retrieval, and mathematical morphology. Contact him at the School of Computer Science, Universidad Politécnica de Madrid, 28660 Boadilla del Monte, Madrid (Spain); [email protected]

    Holger Billhardt received his M.Sc. in Computer Science from the Technical University of Leipzig, Germany, in 1994. He has been working from 1997 to 2001 as a research fellow at the Medical Informatics Group at the Universidad Politécnica de Madrid, Spain, where he received his Ph.D. in Computer Science in 2003. Dr. Billhardt is currently an Associate Professor at the Department of Informatics, Statistics and Telematics at the University Rey Juan Carlos of Madrid. His research interests include information retrieval, the use of multiagent systems for information access and retrieval, and its applications in the field of biomedicine.

    Fernando Martin Sanchez earned his bachelor's degree in Biochemistry and Molecular Biology in 1986 from the Autonomous University of Madrid and received a MSc in Knowledge Engineering in 1987 and a Ph.D. in Computer Science in 1990 from the Polytechnic University of Madrid. He was a postdoctoral fellow at the Emory University Hospital-Georgia Institute of Technology Joint Research Program in Biomedical Informatics. Dr. Martin-Sanchez serves currently as Head of the Medical Bioinformatics Department of the National Institute of Health “Carlos III” of Spain, where he currently leads a multidisciplinary research team focused on Biomedical Informatics and microarray applications in genomic medicine. He regularly teaches on these subjects in public health schools, universities and hospitals.

    Antonio Sousa Pereira received the degree in Electrical Engineering from the University of Porto, and Ph.D. degree, in Electrical Engineering from the University of Aveiro, where he is currently full professor. He is Director of IEETA, a R&D Institute, and Coordinator of the Information Systems and Telematics Lab. His main research interests are in telematics in healthcare and biomedical informatics.

    View full text