Using literature-based discovery to identify disease candidate genes

https://doi.org/10.1016/j.ijmedinf.2004.04.024Get rights and content

Summary

We present BITOLA, an interactive literature-based biomedical discovery support system. The goal of this system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts, by mining the bibliographic database MEDLINE®. To make the system more suitable for disease candidate gene discovery and to decrease the number of candidate relations, we integrate background knowledge about the chromosomal location of the starting disease as well as the chromosomal location of the candidate genes from resources such as LocusLink and Human Genome Organization (HUGO). BITOLA can also be used as an alternative way of searching the MEDLINE database. The system is available at http://www.mf.uni-lj.si/bitola/.

Introduction

With the rapidly growing body of scientific knowledge and increasing specialization, it is possible that the research of one group might solve an important problem of another, without the two groups being aware of each other's work. The goal of literature-based discovery is to address this situation by uncovering new, potentially meaningful relations between a starting concept of interest and other concepts. In the field of biomedicine, a great deal of knowledge is recorded at least in secondary form in bibliographic databases such as MEDLINE as well as various specialized molecular biology databases. These resources provide both an opportunity and a need for developing advanced methods and tools for computer-supported knowledge discovery. For example, we might devise a system that looks for genes that cause a particular disease or for drugs that treat that disease.

In this paper we present an interactive biomedical discovery support system (called BITOLA) for the field of biomedicine, particularly for discovering candidate genes in etiological relationships with diseases. The system can be used as a research idea generator or it can be used as an alternative method of searching MEDLINE (first summary, then details).

The number of genes known to be the causal agent for human diseases is growing rapidly as a by-product of the Human Genome Project. Currently, a query to LocusLink reveals over 1600 genes that cause over 2000 diseases, giving literature references to establish the evidence by which such disease–gene associations are made. Yet there are a significant number of diseases known to be caused by genes where the exact link to a specific gene has not been made. A query to LocusLink shows over 765 genes known for phenotype only. In these cases, the genes are known because of a specific disease phenotype, usually with a chromosomal region narrowed by linkage studies in families with the disease. These 765 genes represent diseases that are potential candidates for the techniques described in this paper and the BITOLA system.

Section snippets

Background

The idea of discovering new relations from a bibliographic database was introduced by Swanson [1] who, together with Smalheiser, made seven medical discoveries that have been published in relevant medical journals. The main idea is first to find all the concepts Y related to the starting concept X (e.g. if X is a disease then Y might be pathological functions, symptoms, etc.). Then all the concepts Z related to Y are found (e.g. if Y is a pathological function, Z might be a molecule,

Materials

The major database in our approach is MEDLINE, created at the National Library of Medicine® (NLM®). Each citation is associated with a set of MeSH terms that describe the content of the associated article. MeSH is a controlled vocabulary and thesaurus used for indexing articles and for searching MeSH-indexed databases, in particular, MEDLINE.

The Unified Medical Language System® (UMLS®) project that NLM began in 1986 was undertaken in order to provide a mechanism for linking diverse medical

System overview

In Fig. 1 we can see a conceptual overview of the BITOLA system and its components. From a set of input databases, using a knowledge extraction process, we build the knowledge base of the system. The input databases represent the known biomedical knowledge. We represent this knowledge in the knowledge base in a formal form as a set of biomedical concepts, association rules between these concepts and additional background knowledge about the concepts. The association rules (based on concept

Results and an example

Although BITOLA can be used for biomedical discovery support in general, we think it is especially useful for finding new relations between diseases and genes. This is a consequence of the integration of background knowledge and is a unique feature not present in other literature-based discovery support systems. There are several new possible application scenarios of our system.

In one scenario, we might start with a genetic disease for which the global chromosomal region is known, but not the

Discussion

We would like to highlight in this section some of the terminology problems we faced during the development of BITOLA. MeSH is the primary source of concepts in our literature-based discovery approach. However, MeSH does not contain many specific genetic diseases and is struggling to add specific MeSH terms for the human genes. Consequently, we were forced to detect and extract the gene symbols from the title and abstract MEDLINE fields. We also propose a solution for one of the major problems

Conclusion

We present an interactive literature-based biomedical discovery support system (BITOLA). The system can be used as a research idea generator or as an alternative method of searching MEDLINE. To decrease the number of candidate relations and to make the system more suitable for disease candidate gene discovery, we include genetic knowledge about the chromosomal location of the starting disease as well as the chromosomal location of the candidate genes.

Acknowledgments

Part of this research was conducted while Dimitar Hristovski was a postdoctoral fellow at the National Library of Medicine (NLM). Dimitar Hristovski thanks NLM and the ORISE program for support. In addition, we would like to thank Tom Rindflesch and Neil Smalheiser for providing valuable comments and insights.

References (25)

  • H.M. Wain et al.

    Guidelines for human gene nomenclature

    Genomics

    (2002)
  • D.R. Swanson

    Fish oil, Raynaud's syndrome, and undiscovered public knowledge

    Perspect. Biol. Med.

    (1986)
  • M.D. Gordon et al.

    Toward discovery support systems: a replication, re-examination, and extension of Swanson's work on literature-based discovery of a connection between Raynaud's and fish oil

    J. Am. Soc. Inf. Sci.

    (1996)
  • M. Weeber et al.

    Using concepts in literature-based discovery: simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries

    J. Am. Soc. Inf. Sci. Technol.

    (2001)
  • M. Weeber et al.

    Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide

    J. Am. Med. Inform. Assoc.

    (2003)
  • P. Srinivasan

    Text mining: generating hypotheses from MEDLINE

    J. Am. Soc. Inf. Sci. Technol.

    (2004)
  • D. Hristovski et al.

    Supporting discovery in medicine by association rule mining in MEDLINE and UMLS

    Medinformation

    (2001)
  • T.K. Jenssen et al.

    A literature network of human genes for high-throughput analysis of gene expression

    Nat. Genet.

    (2001)
  • B.J. Stapley et al.

    Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in MEDLINE abstracts

    Pac. Symp. Biocomput.

    (2000)
  • M. Stephens et al.

    Detecting gene relations from MEDLINE abstracts

    Pac. Symp. Biocomput.

    (2001)
  • T. Sekimizu et al.

    Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts

  • T.C. Rindflesch et al.

    Semantic relations asserting the etiology of genetic diseases

  • Cited by (0)

    View full text