Using literature-based discovery to identify disease candidate genes
Introduction
With the rapidly growing body of scientific knowledge and increasing specialization, it is possible that the research of one group might solve an important problem of another, without the two groups being aware of each other's work. The goal of literature-based discovery is to address this situation by uncovering new, potentially meaningful relations between a starting concept of interest and other concepts. In the field of biomedicine, a great deal of knowledge is recorded at least in secondary form in bibliographic databases such as MEDLINE as well as various specialized molecular biology databases. These resources provide both an opportunity and a need for developing advanced methods and tools for computer-supported knowledge discovery. For example, we might devise a system that looks for genes that cause a particular disease or for drugs that treat that disease.
In this paper we present an interactive biomedical discovery support system (called BITOLA) for the field of biomedicine, particularly for discovering candidate genes in etiological relationships with diseases. The system can be used as a research idea generator or it can be used as an alternative method of searching MEDLINE (first summary, then details).
The number of genes known to be the causal agent for human diseases is growing rapidly as a by-product of the Human Genome Project. Currently, a query to LocusLink reveals over 1600 genes that cause over 2000 diseases, giving literature references to establish the evidence by which such disease–gene associations are made. Yet there are a significant number of diseases known to be caused by genes where the exact link to a specific gene has not been made. A query to LocusLink shows over 765 genes known for phenotype only. In these cases, the genes are known because of a specific disease phenotype, usually with a chromosomal region narrowed by linkage studies in families with the disease. These 765 genes represent diseases that are potential candidates for the techniques described in this paper and the BITOLA system.
Section snippets
Background
The idea of discovering new relations from a bibliographic database was introduced by Swanson [1] who, together with Smalheiser, made seven medical discoveries that have been published in relevant medical journals. The main idea is first to find all the concepts Y related to the starting concept X (e.g. if X is a disease then Y might be pathological functions, symptoms, etc.). Then all the concepts Z related to Y are found (e.g. if Y is a pathological function, Z might be a molecule,
Materials
The major database in our approach is MEDLINE, created at the National Library of Medicine® (NLM®). Each citation is associated with a set of MeSH terms that describe the content of the associated article. MeSH is a controlled vocabulary and thesaurus used for indexing articles and for searching MeSH-indexed databases, in particular, MEDLINE.
The Unified Medical Language System® (UMLS®) project that NLM began in 1986 was undertaken in order to provide a mechanism for linking diverse medical
System overview
In Fig. 1 we can see a conceptual overview of the BITOLA system and its components. From a set of input databases, using a knowledge extraction process, we build the knowledge base of the system. The input databases represent the known biomedical knowledge. We represent this knowledge in the knowledge base in a formal form as a set of biomedical concepts, association rules between these concepts and additional background knowledge about the concepts. The association rules (based on concept
Results and an example
Although BITOLA can be used for biomedical discovery support in general, we think it is especially useful for finding new relations between diseases and genes. This is a consequence of the integration of background knowledge and is a unique feature not present in other literature-based discovery support systems. There are several new possible application scenarios of our system.
In one scenario, we might start with a genetic disease for which the global chromosomal region is known, but not the
Discussion
We would like to highlight in this section some of the terminology problems we faced during the development of BITOLA. MeSH is the primary source of concepts in our literature-based discovery approach. However, MeSH does not contain many specific genetic diseases and is struggling to add specific MeSH terms for the human genes. Consequently, we were forced to detect and extract the gene symbols from the title and abstract MEDLINE fields. We also propose a solution for one of the major problems
Conclusion
We present an interactive literature-based biomedical discovery support system (BITOLA). The system can be used as a research idea generator or as an alternative method of searching MEDLINE. To decrease the number of candidate relations and to make the system more suitable for disease candidate gene discovery, we include genetic knowledge about the chromosomal location of the starting disease as well as the chromosomal location of the candidate genes.
Acknowledgments
Part of this research was conducted while Dimitar Hristovski was a postdoctoral fellow at the National Library of Medicine (NLM). Dimitar Hristovski thanks NLM and the ORISE program for support. In addition, we would like to thank Tom Rindflesch and Neil Smalheiser for providing valuable comments and insights.
References (25)
- et al.
Guidelines for human gene nomenclature
Genomics
(2002) Fish oil, Raynaud's syndrome, and undiscovered public knowledge
Perspect. Biol. Med.
(1986)- et al.
Toward discovery support systems: a replication, re-examination, and extension of Swanson's work on literature-based discovery of a connection between Raynaud's and fish oil
J. Am. Soc. Inf. Sci.
(1996) - et al.
Using concepts in literature-based discovery: simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries
J. Am. Soc. Inf. Sci. Technol.
(2001) - et al.
Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide
J. Am. Med. Inform. Assoc.
(2003) Text mining: generating hypotheses from MEDLINE
J. Am. Soc. Inf. Sci. Technol.
(2004)- et al.
Supporting discovery in medicine by association rule mining in MEDLINE and UMLS
Medinformation
(2001) - et al.
A literature network of human genes for high-throughput analysis of gene expression
Nat. Genet.
(2001) - et al.
Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in MEDLINE abstracts
Pac. Symp. Biocomput.
(2000) - et al.
Detecting gene relations from MEDLINE abstracts
Pac. Symp. Biocomput.
(2001)