ONTOFUSION: Ontology-based integration of genomic and clinical databases

doi:10.1016/j.compbiomed.2005.02.004

Computers in Biology and Medicine

Volume 36, Issues 7–8, July–August 2006, Pages 712-730

https://doi.org/10.1016/j.compbiomed.2005.02.004 Get rights and content

Abstract

ONTOFUSION is an ontology-based system designed for biomedical database integration. It is based on two processes: mapping and unification. Mapping is a semi-automated process that uses ontologies to link a database schema with a conceptual framework—named virtual schema. There are three methodologies for creating virtual schemas, according to the origin of the domain ontology used: (1) top-down—e.g. using an existing ontology, such as the UMLS or Gene Ontology—, (2) bottom-up—building a new domain ontology— and (3) a hybrid combination. Unification is an automated process for integrating ontologies and hence the database to which they are linked. Using these methods, we employed ONTOFUSION to integrate a large number of public genomic and clinical databases, as well as biomedical ontologies.

Introduction

New technologies are being created to facilitate information search, access, retrieval and gathering from remote sources over the World Wide Web. In this scenario, developers are looking forward to the Semantic Web and related technologies that should facilitate information-related tasks in many areas. One such area is biomedicine, where collaborative efforts over the Web have led to significant scientific advances and accelerated efforts such as the Human Genome Project, among others. In this regard, research carried out during the last few decades has led to controlled vocabularies and taxonomies such as the UMLS [1], Gene Ontology [2], and others.

Ontologies provide the basis for the Semantic Web. Historically, the concept of ontology has a philosophical meaning, related to metaphysics. In informatics, ontologies provide a conceptual framework for modeling a knowledge domain. Considering medicine and biology, ontologies can contribute to bridging the gap between both fields by providing new conceptual frameworks. For instance, in the area of heterogeneous database integration, ontologies will provide the platform for sharing common vocabularies by modeling scientific domains. This exchange should prove fundamental in issues such as genomic medicine, where genomic and medical information will be jointly collected and analyzed to create new models of health care.

Biological and medical databases have traditionally been separate. Recent developments, such as the Iceland database [3], the biobanks and a clinical/genomic database under construction at the Mayo Clinic in Rochester, USA [4], are being carried out to gather biological and medical information. In this sense, ontologies can be particularly helpful for providing integrated approaches to data collection and analysis.

In this paper, we describe a project carried out over the last few years with support from the European Commission. This project, called INFOGENMED, aimed to develop various methods and tools for database integration from remote sources, based on intelligent agents and ontologies. The focus of this paper is related to the components of the system that are directly linked to ontologies. The system has been implemented and evaluated with biological and medical information. However, given its domain independent features, the ONTOFUSION system can be also used in other application domains.

The paper is organized as follows. Section 2 gives background on existing database integration methods and ontologies, especially from the biomedical point of view. In Section 3, we present the ONTOFUSION approach to database integration. Section 4 describes the evaluation of the system and Section 5 provides some discussion. Finally, Section 6 gives some conclusions and directions for further research.

Section snippets

Background

Biomedical institutions are producing an increasing amount of data. Given this scenario, professionals are demanding new models and tools to search, store and analyze information. Since the development of the World Wide Web, collaborative efforts among remote institutions and researchers have increased the need for information exchange and distributed data processing. We provide below a description of recent research on database integration. Since the latest efforts on database integration

The ONTOFUSION approach to database integration

Database integration at a semantic level is a key issue for providing homogeneous access to clinical and genetic databases. The integration approach used in ONTOFUSION is based on two processes: mapping and unification. In the mapping process, the physical schema of each database is mapped to what we call a “virtual schema”. Virtual schemas are ontologies representing the structure of the information contained in a given database at a conceptual level. In the unification process, several

System evaluation

The system has been successfully tested with twenty databases:

•
Eight private databases containing biomedical information of various types and stored in database management systems, such as MySQL, PointBase, Access, and others.
•
Nine public databases: Ensembl, SwissProt, OMIM, Prosite, SNP, PDB, ENZYME, LocusLink, and InterPRO.
•
Three databases containing biomedical ontologies: UMLS, GO and HGNC.

Although ONTOFUSION is a research tool and needs additional refinement, results are promising. A large

Discussion

ONTOFUSION has been implemented using a multiagent architecture. User agents play the role of users in the system and virtual schema agents act as wrappers of physical or virtual (unified) databases. Fig. 6 shows an example of the agent messages involved when a query is received within the system. Virtual agents are connected following the unification hierarchy of the databases. When a user submits a query to the system, it is translated, divided and transferred by the virtual schema agents

Conclusions

Modern database integration tools are moving towards ontology-based approaches. Our approach, ONTOFUSION, has followed this trend. Ontology-based systems offer the possibility of navigating through the ontology concepts and exploring their relationships. These approaches ease the understanding of these concepts and their underlying knowledge. This is especially important in fields such as biology and medicine, where the number of concepts is very large and new concepts are appearing all the

Summary

New technologies are being created to facilitate information search, access, retrieval and gathering from remote sources over the World Wide Web. In this paper, we describe ONTOFUSION, an approach to information integration that has been developed as part of a project carried out over the last few years with support from the European Commission. This project aimed to develop various methods and tools for integrating databases from heterogeneous sources, using intelligent agents and ontologies.

Acknowledgements

This research has been supported by funding from the EC INFOGENMED project and the INFOBIOMED Network of Excellence, the INBIOMED project, Ministry of Health, Spain, and the Ministry of Science and Technology, Spain.

David Pérez del Rey is a research assistant at the Biomedical Informatics Group at the Polytechnical University of Madrid (Spain). He received a B.S. in Computer Science from the Complutense University of Madrid, including a year in the University of Southampton as a visitor student. He is currently finishing his Ph.D. thesis on Ontology-based KDD process for biomedical information. His research interests include data integration, data mining, KDD and the Semantic Web. Contact him at the School

References (31)

T.R. Gruber
A translation approach to portable ontology specifications
Knowledge Acquisition
(1993)
C. Lindberg
The Unified Medical Language System (UMLS) of the National Library of Medicine
J. Am. Med. Record Assoc.
(1990)
The Gene Ontology Consortium, Gene ontology: tool for the unification of biology, Nat. Genet. 25 (2000)...
G.J. Annas
Rules for research on human genetic variation-lessons from Iceland
New England J. Med.
(2000)
P.C. de Groen, A healthy database, IBM creating a system for millions of Mayo clinic patient files, in: Renee Berg...
R. Kimball et al.
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
(2002)
A.S. Lopatenko, Information retrieval in current research information systems, Workshop on Knowledge Markup and...
G. Wiederhold
Mediators in the architecture of future information systems
IEEE Comput.
(1992)
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom, The TSIMMIS project:...
Y. Arens et al.
Query processing in the SIMS information mediator

C.A. Knoblock et al.

The Ariadne approach to Web-based information integration

Int. J. Cooperative Inform. Syst.

(2001)

M.C. Shan et al.

Pegasus: a heterogeneous information management system

M.J. Carey, L.M. Haas, P.M. Schwarz, M. Arya, W.F. Cody, R. Fagin, M. Flickner, A.W. Luniewski, W. Niblack, D....

L.M. Haas et al.

Discoverylink: a system for integrated access to life sciences data sources

IBM Syst. J.

(2001)

P.G. Baker et al.

TAMBIS: transparent access to multiple bioinformatics information sources

Bioinformatics

(2000)

Cited by (74)

Translational bioinformatics: Informatics, medicine, and -omics
2019, Encyclopedia of Biomedical Engineering
This article reviews some recent achievements reported in the area of Translational Bioinformatics (TBI), which has evolved rapidly as result of the Human Genome Project and subsequent -omic projects. Our goal is to support the understanding and enhancement of informatics research and applications at the intersection between medicine and the -omics fields. We discuss current progress and directions in the road ahead for this field, which already involves a significant number of dedicated professionals in research projects and conferences. Through a literature review, a list of topics of informatics research in TBI has been created, including decision support systems, natural language processing, standards, information retrieval, data, text and opinion mining, electronic health records (EHRs), and data integration. We also describe examples of the most challenging categories for research, such as discovery in EHRs, pharmacogenomics, drug repurposing, and genomic testing for individuals. We conclude with an overview of some of the challenges and opportunities presented by this field for research and education, particularly from the perspective of precision medicine.
Integrating electronic health record information to support integrated care: Practical application of ontologies to improve the accuracy of diabetes disease registers
2014, Journal of Biomedical Informatics
Information in Electronic Health Records (EHRs) are being promoted for use in clinical decision support, patient registers, measurement and improvement of integration and quality of care, and translational research. To do this EHR-derived data product creators need to logically integrate patient data with information and knowledge from diverse sources and contexts.
To examine the accuracy of an ontological multi-attribute approach to create a Type 2 Diabetes Mellitus (T2DM) register to support integrated care.
Guided by Australian best practice guidelines, the T2DM diagnosis and management ontology was conceptualized, contextualized and validated by clinicians; it was then specified, formalized and implemented. The algorithm was standardized against the domain ontology in SNOMED CT-AU. Accuracy of the implementation was measured in 4 datasets of varying sizes (927–12,057 patients) and an integrated dataset (23,793 patients). Results were cross-checked with sensitivity and specificity calculated with 95% confidence intervals.
Incrementally integrating Reason for Visit (RFV), medication (Rx), and pathology in the algorithm identified nearly100% of T2DM cases. Incrementally integrating the four datasets improved accuracy; controlling for sample size, data incompleteness and duplicates. Manual validation confirmed the accuracy of the algorithm.
Integrating multiple data elements within an EHR using ontology-based case-finding algorithms can improve the accuracy of the diagnosis and compensate for suboptimal data quality, and hence creating a dataset that is more fit-for-purpose. This clinical and pragmatic application of ontologies to EHR data improves the integration of data and the potential for better use of data to improve the quality of care.
Validating an ontology-based algorithm to identify patients with Type 2 Diabetes Mellitus in Electronic Health Records
2014, International Journal of Medical Informatics
Citation Excerpt :
The DMO was subsequently implemented, using the Semantic Protocol and RDF Query Language (SPARQL) to identify T2DM phenotypes in a EHR-derived dataset. By incorporating defined semantic SPARQL queries, DMO was able to generate logical inferences and control the inclusion/exclusion of relevant objects [11], such as the patient with a T2DM-specific RFV, Rx or pathology test [25]. The validation of the DMO-based algorithm included a comparison with a manual audit of the EHR from which the data was derived.
Improving healthcare for people with chronic conditions requires clinical information systems that support integrated care and information exchange, emphasizing a semantic approach to support multiple and disparate Electronic Health Records (EHRs). Using a literature review, the Australian National Guidelines for Type 2 Diabetes Mellitus (T2DM), SNOMED-CT-AU and input from health professionals, we developed a Diabetes Mellitus Ontology (DMO) to diagnose and manage patients with diabetes. This paper describes the manual validation of the DMO-based approach using real world EHR data from a general practice (n = 908 active patients) participating in the electronic Practice Based Research Network (ePBRN).
The DMO-based algorithm to query, using Semantic Protocol and RDF Query Language (SPARQL), the structured fields in the ePBRN data repository were iteratively tested and refined. The accuracy of the final DMO-based algorithm was validated with a manual audit of the general practice EHR. Contingency tables were prepared and Sensitivity and Specificity (accuracy) of the algorithm to diagnose T2DM measured, using the T2DM cases found by manual EHR audit as the gold standard. Accuracy was determined with three attributes – reason for visit (RFV), medication (Rx) and pathology (path) – singly and in combination.
The Sensitivity and Specificity of the algorithm were 100% and 99.88% with RFV; 96.55% and 98.97% with Rx; and 15.6% and 98.92% with Path. This suggests that Rx and Path data were not as complete or correct as the RFV for this general practice, which kept its RFV information complete and current for diabetes. However, the completeness is good enough for this purpose as confirmed by the very small relative deterioration of the accuracy (Sensitivity and Specificity of 97.67% and 99.18%) when calculated for the combination of RFV, Rx and Path. The manual EHR audit suggested that the accuracy of the algorithm was influenced by data quality such as incorrect data due to mistaken units of measurement and unavailable data due to non-documentation or documented in the wrong place or progress notes, problems with data extraction, encryption and data management errors.
This DMO-based algorithm is sufficiently accurate to support a semantic approach, using the RFV, Rx and Path to define patients with T2DM from EHR data. However, the accuracy can be compromised by incomplete or incorrect data. The extent of compromise requires further study, using ontology-based and other approaches.
RDFBuilder: A tool to automatically build RDF-based interfaces for MAGE-OM microarray data sources
2013, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Over the last few years, semantic homogenization efforts have been commonly addressed by applying semantic web technologies [3–5]. A significant example is the development of ontologies that act as unified vocabularies for the biomedical domain [6,7]. To name a few examples, the Foundational Model of Anatomy (FMA) contains a symbolic representation of the phenotypic structure of the human body [8], the Gene Ontology (GO) offers a representation of gene and gene product attributes across species [9], and the ACGT Master Ontology represents the domain of cancer research [10].
This paper presents RDFBuilder, a tool that enables RDF-based access to MAGE-ML-compliant microarray databases. We have developed a system that automatically transforms the MAGE-OM model and microarray data stored in the ArrayExpress database into RDF format. Additionally, the system automatically enables a SPARQL endpoint. This allows users to execute SPARQL queries for retrieving microarray data, either from specific experiments or from more than one experiment at a time. Our system optimizes response times by caching and reusing information from previous queries. In this paper, we describe our methods for achieving this transformation. We show that our approach is complementary to other existing initiatives, such as Bio2RDF, for accessing and retrieving data from the ArrayExpress database.
Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature
2013, International Journal of Medical Informatics
Citation Excerpt :
Enterprise Ontology [46] and TOronto Virtual Enterprise (TOVE) [47]; representation languages such as OWL, SWRL, XML and RDF; logic ontology reasoners [48] to provide automated support for reasoning tasks in ontology and instance checking [46] such as Pellet, Hermit, Fact++, Cyc; and layered ontology methodology and tools such as ontology-based multi-agent systems (OBMAS) [49*,50*]. The tasks involved in the development of a DQ ontology [51*,52*] include the: review of concepts required for ontological views of DQ, capture of terms to produce ontologies for DQ [52*], identification of errors in DQ and DQ ontologies, integration of data from heterogeneous clinical databases [39], and evaluation of DQ and DQ ontology [53*]. Ontology tools are currently the subject of a more detailed literature review.
Effective use of routine data to support integrated chronic disease management (CDM) and population health is dependent on underlying data quality (DQ) and, for cross system use of data, semantic interoperability. An ontological approach to DQ is a potential solution but research in this area is limited and fragmented.
Identify mechanisms, including ontologies, to manage DQ in integrated CDM and whether improved DQ will better measure health outcomes.
A realist review of English language studies (January 2001–March 2011) which addressed data quality, used ontology-based approaches and is relevant to CDM.
We screened 245 papers, excluded 26 duplicates, 135 on abstract review and 31 on full-text review; leaving 61 papers for critical appraisal. Of the 33 papers that examined ontologies in chronic disease management, 13 defined data quality and 15 used ontologies for DQ. Most saw DQ as a multidimensional construct, the most used dimensions being completeness, accuracy, correctness, consistency and timeliness. The majority of studies reported tool design and development (80%), implementation (23%), and descriptive evaluations (15%). Ontological approaches were used to address semantic interoperability, decision support, flexibility of information management and integration/linkage, and complexity of information models.
DQ lacks a consensus conceptual framework and definition. DQ and ontological research is relatively immature with little rigorous evaluation studies published. Ontology-based applications could support automated processes to address DQ and semantic interoperability in repositories of routinely collected data to deliver integrated CDM. We advocate moving to ontology-based design of information systems to enable more reliable use of routine data to measure health mechanisms and impacts.
Clinical Natural Language Processing Systems for Information Retrieval from Unstructured Medical Narratives
2023, Medical Data Analysis and Processing using Explainable Artificial Intelligence

View all citing articles on Scopus

Victor Maojo got his MD degree at the University of Oviedo (Spain) in 1985 and his Ph.D. in Computer Science at the Universidad Politécnica de Madrid (UPM) in 1990. At the UPM, he is currently an associate professor and associate director of the Artificial Intelligence Lab. Before entering the faculty of the UPM, he was a postdoctoral researcher and consultant in Georgia Tech (Atlanta, USA, 1990–1991), and a research fellow at the Decision Systems Group (Harvard University-MIT, Boston, USA, 1991–1993). He has been the principal investigator in more than 20 national and international projects and has authored more than 100 scientific papers and books. He has been a member of numerous committees at international conferences and journals and served as an expert for the IV and V Framework Programmes of the European Commission.

Miguel García Remesal is a research assistant at the Biomedical Informatics Group at the Universidad Politécnica de Madrid (Spain). He received a B.S. in Computer Science from the Universidad Politécnica de Madrid. He is currently finishing his Ph.D. thesis on Ontology-based Information Retrieval for biomedical information resources. His research interests include information retrieval, text mining, and the Semantic Web. Contact him at the School of Computer Science, Polytechnical University of Madrid, 28660 Boadilla del Monte, Madrid (Spain); [email protected]

Raúl Alonso Calvo is a research assistant at the Biomedical Informatics Group at the Universidad Politécnica de Madrid (Spain). He received a B.S. in Computer Science from the Universidad Politécnica de Madrid. He is currently finishing his Ph.D. thesis on Content-Based Image Retrieval and Ontology-based Information Retrieval for biomedical information resources. His research interests include image analysis, information retrieval, and mathematical morphology. Contact him at the School of Computer Science, Universidad Politécnica de Madrid, 28660 Boadilla del Monte, Madrid (Spain); [email protected]

Holger Billhardt received his M.Sc. in Computer Science from the Technical University of Leipzig, Germany, in 1994. He has been working from 1997 to 2001 as a research fellow at the Medical Informatics Group at the Universidad Politécnica de Madrid, Spain, where he received his Ph.D. in Computer Science in 2003. Dr. Billhardt is currently an Associate Professor at the Department of Informatics, Statistics and Telematics at the University Rey Juan Carlos of Madrid. His research interests include information retrieval, the use of multiagent systems for information access and retrieval, and its applications in the field of biomedicine.

Fernando Martin Sanchez earned his bachelor's degree in Biochemistry and Molecular Biology in 1986 from the Autonomous University of Madrid and received a MSc in Knowledge Engineering in 1987 and a Ph.D. in Computer Science in 1990 from the Polytechnic University of Madrid. He was a postdoctoral fellow at the Emory University Hospital-Georgia Institute of Technology Joint Research Program in Biomedical Informatics. Dr. Martin-Sanchez serves currently as Head of the Medical Bioinformatics Department of the National Institute of Health “Carlos III” of Spain, where he currently leads a multidisciplinary research team focused on Biomedical Informatics and microarray applications in genomic medicine. He regularly teaches on these subjects in public health schools, universities and hospitals.

Antonio Sousa Pereira received the degree in Electrical Engineering from the University of Porto, and Ph.D. degree, in Electrical Engineering from the University of Aveiro, where he is currently full professor. He is Director of IEETA, a R&D Institute, and Coordinator of the Information Systems and Telematics Lab. His main research interests are in telematics in healthcare and biomedical informatics.

View full text

ONTOFUSION: Ontology-based integration of genomic and clinical databases

Abstract

Introduction

Section snippets

Background

The ONTOFUSION approach to database integration

System evaluation

Discussion

Conclusions

Summary

Acknowledgements

Knowledge Acquisition

The Unified Medical Language System (UMLS) of the National Library of Medicine

J. Am. Med. Record Assoc.

Rules for research on human genetic variation-lessons from Iceland

New England J. Med.

The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling

Mediators in the architecture of future information systems

IEEE Comput.

Query processing in the SIMS information mediator